[jira] [Resolved] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2155.
--
Resolution: Fixed

Latest patch committed in r1713885

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000658#comment-15000658
 ] 

Lewis John McGibbney commented on NUTCH-2165:
-

It means that the remaining data is not dumped.

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2150.
--
Resolution: Fixed

Resolved in r1713892

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2015_joyce_9Nov2015.patch
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000787#comment-15000787
 ] 

Hudson commented on NUTCH-2150:
---

SUCCESS: Integrated in Nutch-trunk #3305 (See 
[https://builds.apache.org/job/Nutch-trunk/3305/])
NUTCH-2150 - Update help text and remove 'current' folder requirements (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1713892])
* trunk/src/java/org/apache/nutch/util/ProtocolStatusStatistics.java


> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2015_joyce_9Nov2015.patch
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000788#comment-15000788
 ] 

Hudson commented on NUTCH-1911:
---

SUCCESS: Integrated in Nutch-trunk #3305 (See 
[https://builds.apache.org/job/Nutch-trunk/3305/])
NUTCH-1911 - Recommit help fixes and remove 'current' folder requirement 
(joyce: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1713890])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java


> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10, 1.11
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch, 
> NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000841#comment-15000841
 ] 

Michael Joyce commented on NUTCH-2167:
--

Hi folks,

All looks good and tests run fine after moving this over for testing. I'm going 
to svn cp them over if no one has any objections.

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2167 started by Michael Joyce.

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000667#comment-15000667
 ] 

Hudson commented on NUTCH-2155:
---

SUCCESS: Integrated in Nutch-trunk #3304 (See 
[https://builds.apache.org/job/Nutch-trunk/3304/])
NUTCH-2155 - Update crawlcomplete help and drop 'current' folder requirements 
(joyce: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1713885])
* trunk/src/java/org/apache/nutch/util/CrawlCompletionStats.java


> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1911 started by Michael Joyce.

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10, 1.11
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch, 
> NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-1911.
--
Resolution: Fixed

Resolved in r1713890

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.11, 1.10
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch, 
> NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2150 started by Michael Joyce.

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2015_joyce_9Nov2015.patch
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2155 started by Michael Joyce.

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-11 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2166:


 Summary: Add reverse URL format to dump tool
 Key: NUTCH-2166
 URL: https://issues.apache.org/jira/browse/NUTCH-2166
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Affects Versions: 1.10, 2.3
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 2.4, 1.11


Update the FileDumper tool with an option for dumping files to the output 
directory in reverse URL format.

So the file for 
http://bar.foo.com:8983/to/index.html?a=b

Would dump to
/com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2166 started by Michael Joyce.

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2165:


 Summary: FileDumper Util hard codes part-# folder name
 Key: NUTCH-2165
 URL: https://issues.apache.org/jira/browse/NUTCH-2165
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10, 2.3
Reporter: Michael Joyce
 Fix For: 2.4, 1.11


Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
the part-# folders seem to be hard coded to part-0 in the [FileDumper 
utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
 which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Updates to CHANGES.txt on commit

2015-11-11 Thread Michael Joyce
Hi folks,

It seems like our usual workflow is to update CHANGES on commit (correct me
if I'm wrong here). What do we think about pulling the CHANGES updates from
JIRA as part of our release prep instead? Seems like it would be a bit less
error prone, although I do understand peoples' desires to have CHANGES up
to date all the time.

Thoughts?

-- Jimmy


[jira] [Created] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2167:


 Summary: Backport TableUtil from 2.x for URL reversing
 Key: NUTCH-2167
 URL: https://issues.apache.org/jira/browse/NUTCH-2167
 Project: Nutch
  Issue Type: Sub-task
  Components: tool
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 1.11


The 
[TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
 file provides a number of helpful utilities functions for URL reversing that 
would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Updates to CHANGES.txt on commit

2015-11-11 Thread Mattmann, Chris A (3980)
Mike I honestly prefer just having it as a text file. If you search
way back in the logs Doug talked about this long ago, but I generally
agree. JIRA would be nice but I just like to keep it up to date in text
and in JIRA.

Sorry for the dupe work but it pays off.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Mike Joyce  on behalf of Michael Joyce

Reply-To: "dev@nutch.apache.org" 
Date: Wednesday, November 11, 2015 at 11:21 AM
To: "dev@nutch.apache.org" 
Subject: Updates to CHANGES.txt on commit

>Hi folks,
>
>
>It seems like our usual workflow is to update CHANGES on commit (correct
>me if I'm wrong here). What do we think about pulling the CHANGES updates
>from JIRA as part of our release prep instead? Seems like it would be a
>bit less error prone, although I
> do understand peoples' desires to have CHANGES up to date all the time.
>
>
>Thoughts?
>
>
>-- Jimmy
>
>
>
>
>
>
>



[jira] [Commented] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000912#comment-15000912
 ] 

Lewis John McGibbney commented on NUTCH-2167:
-

Yes, an example of this being useful is within the filedumper. For example if 
we can reverse URLs then raw content can be sent to s3 for archived storage but 
also retrieved with minimal effort as we can the just re-reverse the URL.

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2165 started by Michael Joyce.

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce reassigned NUTCH-2165:


Assignee: Michael Joyce

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000910#comment-15000910
 ] 

Michael Joyce commented on NUTCH-2165:
--

Oh aye

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2165:
-
Attachment: NUTCH-2165_joyce_11Nov2015.patch

Patch attached

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000923#comment-15000923
 ] 

Michael Joyce commented on NUTCH-2165:
--

Note, the diff looks massive here. This is really just adding an extra loop 
over the parts directories in each segment directory. The tool could probably 
use a bit of cleanup love, but we can address that in a later patch.

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-11 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2120:

Issue Type: Task  (was: Bug)

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-11 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2120:

 Flags: Patch
Patch Info: Patch Available

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2160) Upgrade Selenium Java to 2.48.2

2015-11-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15001105#comment-15001105
 ] 

Lewis John McGibbney commented on NUTCH-2160:
-

Will commit by EoB today unless there are objections

> Upgrade Selenium Java to 2.48.2
> ---
>
> Key: NUTCH-2160
> URL: https://issues.apache.org/jira/browse/NUTCH-2160
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2160.patch
>
>
> Current Selenium support is pegged at a very old version of Firefox. The 
> attached patch, running with the most recent version of Selenium Java, works 
> with Firefox 38.4.0 very well. The remainder of the lib-selenium dependencies 
> have also been updated.
> Thanks
> [~kwhitehall] can you please scope if you get a wee minute?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-11 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2120:

Attachment: NUTCH-2120.patch

Patch which removes this class from Trunk.

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Bug
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)