[jira] [Resolved] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2218.
--
Resolution: Fixed

[~lewismc], This got merged. I added an example to the option you raised as 
well. If that doesn't address your concerns let me know and I'll update in 
another ticket.

{code}
| -> ./bin/nutch crawlcomplete
usage: CrawlCompletionStats [-h] -inputDirs  -mode 
   [-numReducers ] -outputDir 
 -h,--helpShow this message
 -inputDirsComma separated list of crawl directories
  (e.g., "./crawl1,./crawl2")
 -mode  Set statistics gathering mode (by 'host' or
  by 'domain')
 -numReducersOptional number of reduce jobs to use.
  Defaults to 1
 -outputDirOutput directory where results should be
  dumped
{code}

> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-18 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152731#comment-15152731
 ] 

Michael Joyce commented on NUTCH-2218:
--

Sorry for any confusion here folks. Changes were merged in r1731102. README was 
updated in 1731103 since I forgot to update. PR should be closed in 1731103.

> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI

2016-02-12 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2218:
-
Issue Type: Improvement  (was: Bug)

> Switch CrawlCompletion arg parsing to Commons CLI
> -
>
> Key: NUTCH-2218
> URL: https://issues.apache.org/jira/browse/NUTCH-2218
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.12
>
>
> The current CrawlCompletion utility should be updated to use commons CLI 
> instead of doing manual arg parsing and checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2187) Change FileDumper SHAs to all uppercase

2015-12-16 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2187:


 Summary: Change FileDumper SHAs to all uppercase
 Key: NUTCH-2187
 URL: https://issues.apache.org/jira/browse/NUTCH-2187
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Affects Versions: 1.11
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 1.12


It would be nice to have the reverseUrlDirs options dump SHAs in all uppercase 
for consistency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2187) Change FileDumper SHAs to all uppercase

2015-12-16 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2187.
--
Resolution: Duplicate

Going to just resolve this in NUTCH-2182. Thought that patch had already been 
committed.

> Change FileDumper SHAs to all uppercase
> ---
>
> Key: NUTCH-2187
> URL: https://issues.apache.org/jira/browse/NUTCH-2187
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.12
>
>
> It would be nice to have the reverseUrlDirs options dump SHAs in all 
> uppercase for consistency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2182) Make reverseUrlDirs file dumper option hash the URL for consistency

2015-12-16 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2182.
--
Resolution: Fixed

Resolved in r1720466

> Make reverseUrlDirs file dumper option hash the URL for consistency
> ---
>
> Key: NUTCH-2182
> URL: https://issues.apache.org/jira/browse/NUTCH-2182
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.12
>
> Attachments: NUTCH-2182_joyce_8Dec2015.patch
>
>
> At the moment the "reverseUrlDirs" option for FileDumper is terribly brittle 
> and fails on a fair number of edge cases. A more robust way to handle the 
> reverse URL approach to dumping a file is to reverse the server part and hash 
> the URL to use as the file name. This gives us a nice split of files while 
> avoiding a number of likely classes that causes dumps to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2180) FileDumper dumps data, but breaks midway on corrupt segments

2015-12-09 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048889#comment-15048889
 ] 

Michael Joyce commented on NUTCH-2180:
--

Thanks for the patch [~hmanjuna], will scope shortly

> FileDumper dumps data, but breaks midway on corrupt segments
> 
>
> Key: NUTCH-2180
> URL: https://issues.apache.org/jira/browse/NUTCH-2180
> Project: Nutch
>  Issue Type: Bug
>  Components: bin, dumpers
>Affects Versions: 1.11
> Environment: Ubuntu 14.04.3 x64 
>Reporter: Harshavardhan Manjunatha
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> FileDumper should ignore corrupt segments, and continue dumping data instead 
> of throwing NullPointerException
> $ bin/nutch dump -segment ../../../segments/ -outputDir ./firstDump/ -flatdir
> java.lang.NullPointerException
>   at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:175)
>   at org.apache.nutch.tools.FileDumper.main(FileDumper.java:417)
> $



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2180) FileDumper dumps data, but breaks midway on corrupt segments

2015-12-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce reassigned NUTCH-2180:


Assignee: Michael Joyce

> FileDumper dumps data, but breaks midway on corrupt segments
> 
>
> Key: NUTCH-2180
> URL: https://issues.apache.org/jira/browse/NUTCH-2180
> Project: Nutch
>  Issue Type: Bug
>  Components: bin, dumpers
>Affects Versions: 1.11
> Environment: Ubuntu 14.04.3 x64 
>Reporter: Harshavardhan Manjunatha
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> FileDumper should ignore corrupt segments, and continue dumping data instead 
> of throwing NullPointerException
> $ bin/nutch dump -segment ../../../segments/ -outputDir ./firstDump/ -flatdir
> java.lang.NullPointerException
>   at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:175)
>   at org.apache.nutch.tools.FileDumper.main(FileDumper.java:417)
> $



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2182) Make reverseUrlDirs file dumper option hash the URL for consistency

2015-12-08 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2182:
-
Attachment: NUTCH-2182_joyce_8Dec2015.patch

Patch Attached

> Make reverseUrlDirs file dumper option hash the URL for consistency
> ---
>
> Key: NUTCH-2182
> URL: https://issues.apache.org/jira/browse/NUTCH-2182
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 1.11
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.12
>
> Attachments: NUTCH-2182_joyce_8Dec2015.patch
>
>
> At the moment the "reverseUrlDirs" option for FileDumper is terribly brittle 
> and fails on a fair number of edge cases. A more robust way to handle the 
> reverse URL approach to dumping a file is to reverse the server part and hash 
> the URL to use as the file name. This gives us a nice split of files while 
> avoiding a number of likely classes that causes dumps to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2182) Make reverseUrlDirs file dumper option hash the URL for consistency

2015-12-08 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2182:


 Summary: Make reverseUrlDirs file dumper option hash the URL for 
consistency
 Key: NUTCH-2182
 URL: https://issues.apache.org/jira/browse/NUTCH-2182
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Affects Versions: 1.11
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 1.12


At the moment the "reverseUrlDirs" option for FileDumper is terribly brittle 
and fails on a fair number of edge cases. A more robust way to handle the 
reverse URL approach to dumping a file is to reverse the server part and hash 
the URL to use as the file name. This gives us a nice split of files while 
avoiding a number of likely classes that causes dumps to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023388#comment-15023388
 ] 

Michael Joyce commented on NUTCH-2158:
--

+1 on this. Looks good to me

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158-test-protocol-http.patch, NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2173) String.join in FileDumper breaks the build

2015-11-18 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2173:


 Summary: String.join in FileDumper breaks the build
 Key: NUTCH-2173
 URL: https://issues.apache.org/jira/browse/NUTCH-2173
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 1.11


The new FileDumper changes use a 1.8 String function that breaks the build on 
1.7

{code}
String.join
{code}

Thanks [~kwhitehall] for finding this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2173) String.join in FileDumper breaks the build

2015-11-18 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2173 started by Michael Joyce.

> String.join in FileDumper breaks the build
> --
>
> Key: NUTCH-2173
> URL: https://issues.apache.org/jira/browse/NUTCH-2173
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The new FileDumper changes use a 1.8 String function that breaks the build on 
> 1.7
> {code}
> String.join
> {code}
> Thanks [~kwhitehall] for finding this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2173) String.join in FileDumper breaks the build

2015-11-18 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2173.
--
Resolution: Fixed

Resolve in r1715046

> String.join in FileDumper breaks the build
> --
>
> Key: NUTCH-2173
> URL: https://issues.apache.org/jira/browse/NUTCH-2173
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The new FileDumper changes use a 1.8 String function that breaks the build on 
> 1.7
> {code}
> String.join
> {code}
> Thanks [~kwhitehall] for finding this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-17 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2166.
--
Resolution: Fixed

Committed in r1714908

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2166_joyce_13Nov2015.patch
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-13 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004191#comment-15004191
 ] 

Michael Joyce commented on NUTCH-2166:
--

Output from a small example run. I don't know that I'm terribly happy with the 
_file solution. Open to ideas on that.

{code}
dumpoutputtest/
├── edu
│   └── caltech
│   └── www
│   └── http
│   └── _file
└── gov
└── nasa
├── eyes
│   └── http
│   ├── _file
│   ├── earth
│   │   └── _file
│   └── exoplanets
│   └── _file
├── jpl
│   ├── blogs
│   │   └── http
│   │   └── _file
│   ├── http
│   │   └── _file
│   ├── mars
│   │   └── http
│   │   └── _file
│   ├── photojournal
│   │   └── http
│   │   └── _file
│   ├── planetquest
│   │   └── http
│   │   └── _file
│   └── www
│   └── http
│   ├── _file
│   ├── about
│   │   ├── _file
│   │   ├── exec.php
│   │   ├── history.php
│   │   └── reports.php
│   ├── apps
│   │   └── _file
│   ├── asteroidwatch
│   │   └── _file
│   ├── contact_JPL.php
│   ├── edu
│   │   ├── _file
│   │   ├── events
│   │   │   ├── 2015
│   │   │   │   └── 11
│   │   │   │   └── 1
│   │   │   │   └── 
see-the-phases-of-the-moon-by-day-and-by-night
│   │   │   │   └── _file
│   │   │   └── _file
│   │   ├── intern
│   │   │   └── _file
│   │   ├── learn
│   │   │   └── _file
│   │   ├── news
│   │   │   └── _file
│   │   └── teach
│   │   └── _file
│   ├── events
│   │   ├── _file
│   │   ├── lectures.php
│   │   ├── open-house.php
│   │   ├── speakers-bureau.php
│   │   ├── team-competitions.php
│   │   └── tours
│   │   └── views
│   │   └── _file
│   ├── infographics
│   │   └── _file
│   ├── missions
│   │   └── _file
│   ├── multimedia
│   │   └── audio.php
│   ├── news
│   │   ├── _file
│   │   ├── factsheets.php
│   │   ├── mediaroom.php
│   │   └── presskits.php
│   ├── opportunities
│   │   └── _file
│   ├── social
│   │   └── _file
│   ├── spaceimages
│   │   └── _file
│   └── videos
│   └── _file
├── solarsystem
│   └── http
│   └── _file
└── www
└── http
├── _file
└── earthrightnow
└── _file

51 directories, 44 files

{code}

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2166_joyce_13Nov2015.patch
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-12 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002328#comment-15002328
 ] 

Michael Joyce commented on NUTCH-2166:
--

Small change in dump format. Instead of making a bajillion nested folders it 
seems like it might be nicer to simple use the reverse URL as the file name.

So the file for 
http://bar.foo.com:8983/to/index.htm
Would dump to the encoded
/com%2Ffoo%2Fbar%2F8983%2Fhttp%2Fto%2Findex.htm

Of course, we may then run into file name length issues this way. Perhaps 
having both eventually will be useful?

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-12 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2167.
--
Resolution: Fixed

TableUtil copied over in r1714078 and tests copied over in 1714079

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002604#comment-15002604
 ] 

Michael Joyce commented on NUTCH-2165:
--

Thanks [~lewismc], I'll merge shortly

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2165.
--
Resolution: Fixed

Committed in r1714104

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2155.
--
Resolution: Fixed

Latest patch committed in r1713885

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2150.
--
Resolution: Fixed

Resolved in r1713892

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2015_joyce_9Nov2015.patch
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000841#comment-15000841
 ] 

Michael Joyce commented on NUTCH-2167:
--

Hi folks,

All looks good and tests run fine after moving this over for testing. I'm going 
to svn cp them over if no one has any objections.

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2167 started by Michael Joyce.

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1911 started by Michael Joyce.

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10, 1.11
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch, 
> NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-1911.
--
Resolution: Fixed

Resolved in r1713890

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.11, 1.10
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch, 
> NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2150 started by Michael Joyce.

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2015_joyce_9Nov2015.patch
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2155 started by Michael Joyce.

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-11 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2166:


 Summary: Add reverse URL format to dump tool
 Key: NUTCH-2166
 URL: https://issues.apache.org/jira/browse/NUTCH-2166
 Project: Nutch
  Issue Type: Improvement
  Components: tool
Affects Versions: 1.10, 2.3
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 2.4, 1.11


Update the FileDumper tool with an option for dumping files to the output 
directory in reverse URL format.

So the file for 
http://bar.foo.com:8983/to/index.html?a=b

Would dump to
/com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2166 started by Michael Joyce.

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2165:


 Summary: FileDumper Util hard codes part-# folder name
 Key: NUTCH-2165
 URL: https://issues.apache.org/jira/browse/NUTCH-2165
 Project: Nutch
  Issue Type: Bug
  Components: tool
Affects Versions: 1.10, 2.3
Reporter: Michael Joyce
 Fix For: 2.4, 1.11


Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
the part-# folders seem to be hard coded to part-0 in the [FileDumper 
utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
 which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-11 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2167:


 Summary: Backport TableUtil from 2.x for URL reversing
 Key: NUTCH-2167
 URL: https://issues.apache.org/jira/browse/NUTCH-2167
 Project: Nutch
  Issue Type: Sub-task
  Components: tool
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 1.11


The 
[TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
 file provides a number of helpful utilities functions for URL reversing that 
would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2165 started by Michael Joyce.

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce reassigned NUTCH-2165:


Assignee: Michael Joyce

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000910#comment-15000910
 ] 

Michael Joyce commented on NUTCH-2165:
--

Oh aye

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2165:
-
Attachment: NUTCH-2165_joyce_11Nov2015.patch

Patch attached

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-11 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000923#comment-15000923
 ] 

Michael Joyce commented on NUTCH-2165:
--

Note, the diff looks massive here. This is really just adding an extra loop 
over the parts directories in each segment directory. The tool could probably 
use a bit of cleanup love, but we can address that in a later patch.

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce reassigned NUTCH-2150:


Assignee: Michael Joyce  (was: Chris A. Mattmann)

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce reassigned NUTCH-1911:


Assignee: Michael Joyce  (was: Chris A. Mattmann)

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1911:
-
Summary: Improve DomainStatistics tool command line parsing  (was: Imeprove 
DomainStatistics tool command line parsing)

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.10
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce reassigned NUTCH-2155:


Assignee: Michael Joyce  (was: Chris A. Mattmann)

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1911:
-
Fix Version/s: 1.10

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10, 1.11
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2150:
-
Attachment: NUTCH-2015_joyce_9Nov2015.patch

Patch attached to clean up help formatting and drop need for "current" in 
crawldb path

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2015_joyce_9Nov2015.patch
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2155:
-
Attachment: NUTCH-2155_joyce_9Nov2015.patch

Patch attached to address "current" requirements in crawldb path and add more 
helpful "help" info.

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2155_joyce_9Nov2015.patch
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1911:
-
Attachment: NUTCH-1911_joyce_9Nov2015.patch

Attach more recent patch to include removal of requirement for "current" folder

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10, 1.11
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch, 
> NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1911:
-
Fix Version/s: (was: 1.10)
   1.11

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.11
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing

2015-11-09 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1911:
-
Attachment: NUTCH-1911_joyce_9Nov2015.patch

Going to resubmit the attached patch to get these changes back in the code base.

> Improve DomainStatistics tool command line parsing
> --
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Michael Joyce
>Priority: Trivial
> Fix For: 1.10, 1.11
>
> Attachments: NUTCH-1911_joyce_9Nov2015.patch
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-11-02 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985431#comment-14985431
 ] 

Michael Joyce commented on NUTCH-2155:
--

+1 sounds good to me [~sebastien0], I will update it in a patch shortly

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility

2015-11-02 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985427#comment-14985427
 ] 

Michael Joyce commented on NUTCH-2150:
--

Yes, will address in a patch shortly.

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing

2015-11-02 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985436#comment-14985436
 ] 

Michael Joyce commented on NUTCH-1911:
--

Hrm odd, I want to throw some commons-cli at a few of the utilities anyway so 
I'll just address this there.

> Imeprove DomainStatistics tool command line parsing
> ---
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.10
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2155:


 Summary: Create a "crawl completeness" utility
 Key: NUTCH-2155
 URL: https://issues.apache.org/jira/browse/NUTCH-2155
 Project: Nutch
  Issue Type: Improvement
  Components: util
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.12


I've found it useful to have a tool for dumping some "completeness" information 
from a crawl similar to how domainstats does but including fetched and 
unfetched counts per domain/host. This is especially nice when doing vertical 
crawls over a few domains or just to see how much of a host/domain you've 
covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility

2015-10-28 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979196#comment-14979196
 ] 

Michael Joyce commented on NUTCH-2155:
--

Should have a first patch up shortly for review folks

> Create a "crawl completeness" utility
> -
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2150) Add ProtocolStatus Utility

2015-10-27 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2150:


 Summary: Add ProtocolStatus Utility
 Key: NUTCH-2150
 URL: https://issues.apache.org/jira/browse/NUTCH-2150
 Project: Nutch
  Issue Type: Improvement
  Components: util
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.12


It would be nice to have a utility for dumping protocol status code information 
for a crawl database. This will be a utility for getting a dump of the protocol 
status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility

2015-10-27 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977036#comment-14977036
 ] 

Michael Joyce commented on NUTCH-2150:
--

Hi folks,

PR is up for this. You can run the util with something similar to the following

{code}
./bin/nutch protocolstats crawl/crawldb/current/ ./output
{code}

And that will get you something along the lines of

{code}
38  200
19  301
2   302
665 UNFETCHED
{code}

Let me know if you have any questions!

> Add ProtocolStatus Utility
> --
>
> Key: NUTCH-2150
> URL: https://issues.apache.org/jira/browse/NUTCH-2150
> Project: Nutch
>  Issue Type: Improvement
>  Components: util
>Affects Versions: 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> It would be nice to have a utility for dumping protocol status code 
> information for a crawl database. This will be a utility for getting a dump 
> of the protocol status codes that builds off of NUTCH-2129



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959659#comment-14959659
 ] 

Michael Joyce commented on NUTCH-2141:
--

Cool makes sense. Do you have any examples? I'd like to poke as well. You're 
going to need to handle the screenshot functionality differently as well. 
getHTMLContent does more than just return the body content. We probably don't 
really need the DefalultMultiInteractionHandler example either if this 
basically replaces that. [~asitang] might have some ideas as well.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959345#comment-14959345
 ] 

Michael Joyce commented on NUTCH-2141:
--

This was actually brought up in NUTCH-2108. There's also an [example handler | 
https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java]
 that was added to illustrate that as well. The handler wont actually be run 
multiple times so if you need to return concatenated content you need to do it 
in the handler and make sure it's returned appropriately.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-07 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947002#comment-14947002
 ] 

Michael Joyce commented on NUTCH-2129:
--

Fixed the unnecessary init that [~jnioche] caught. Thanks much for the reviews.

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2133) Transfer Selenium Documentation to WIki

2015-10-06 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2133:


 Summary: Transfer Selenium Documentation to WIki
 Key: NUTCH-2133
 URL: https://issues.apache.org/jira/browse/NUTCH-2133
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.10, 2.3
Reporter: Michael Joyce
 Fix For: 2.4, 1.11


There's a decent chunk of Selenium related documentation stuck in READMEs for 
various plugins. I would be nice to get this stuff pushed to the wiki.

E.G.: 
https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-selenium/README.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-06 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945766#comment-14945766
 ] 

Michael Joyce commented on NUTCH-2129:
--

Hey folks, updated PR with the metadata approach for HTTP and FTP. Let me know 
if you have any concerns.

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939939#comment-14939939
 ] 

Michael Joyce commented on NUTCH-2129:
--

Thanks Julien. I figured there would probably be a few thoughts on this, so I 
appreciate the feedback. I'll checkout the stuff you mentioned. Thanks for the 
ideas.

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data

2015-10-01 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940036#comment-14940036
 ] 

Michael Joyce commented on NUTCH-2108:
--

Good stuff [~asitang], glad to see the workaround proved fruitful and great 
example handlers!

> Add a function to the selenium interactive plugin interface to do multiple 
> manipulation of driver and then return the data
> --
>
> Key: NUTCH-2108
> URL: https://issues.apache.org/jira/browse/NUTCH-2108
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> In the interactive selenium plugin we have to create handler classes for each 
> manipulation of a page. Sometimes we need to manipulate a page in many ways 
> and keep track of those manipulations. Like clicking on say each link in a 
> table and then refreshing to get the original page back as even one click can 
> make all other links go away. This can be done in a single loop. Which will 
> be a little too much work and way complicated using multiple handlers. So, I 
> am proposing a new function "String multiProcessDriver(WebDriver driver)"  
> that takes the driver and returns a concatenated String along with the 
> already present "void processDriver(WebDriver driver)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-09-30 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2129:


 Summary: Track Protocol Status in Crawl Datum
 Key: NUTCH-2129
 URL: https://issues.apache.org/jira/browse/NUTCH-2129
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10, 2.3
Reporter: Michael Joyce
 Fix For: 2.4, 1.11


It's become necessary on a few crawls that I run to get protocol status code 
stats. After speaking with [~lewismc] it seemed that there might not be a super 
convenient way of doing this as is, but it would be great to be able to add the 
functionality necessary to pull this information out.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-09-30 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939124#comment-14939124
 ] 

Michael Joyce commented on NUTCH-2129:
--

Hi folks,

Initial pull request up to address this. Note that at the moment this only 
includes updates such that HTTP status codes are saved. I figured it would be 
best to get a conversation started on this before I dive into it too much since 
it's a rather core data structure.

Thoughts or ideas on this folks?

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2115) Add total counts to dump stats

2015-09-23 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2115:


 Summary: Add total counts to dump stats
 Key: NUTCH-2115
 URL: https://issues.apache.org/jira/browse/NUTCH-2115
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers, util
Affects Versions: 1.10
Reporter: Michael Joyce
Priority: Minor
 Fix For: 1.11


It would be nice if the "dump" tool included total counts for the mimetype 
stats that it gives. Something along the lines of the following would be great 
when you have to deal with some larger crawls and don't want to bother doing 
the math yourself.

{code}
Dumper File Stats: 
TOTAL Stats:
[
{"mimeType":"application/xhtml+xml","count":"2"}
{"mimeType":"application/octet-stream","count":"1"}
{"mimeType":"text/html","count":"23"}
]
Total count: 26

FILTERED Stats:
[
{"mimeType":"text/html","count":"23"}
]
Total filtered count: 23
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2115) Add total counts to dump stats

2015-09-23 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905156#comment-14905156
 ] 

Michael Joyce commented on NUTCH-2115:
--

Cheers [~lewismc], thanks for the quick merge!

> Add total counts to dump stats
> --
>
> Key: NUTCH-2115
> URL: https://issues.apache.org/jira/browse/NUTCH-2115
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers, util
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
>Priority: Minor
> Fix For: 1.11
>
>
> It would be nice if the "dump" tool included total counts for the mimetype 
> stats that it gives. Something along the lines of the following would be 
> great when you have to deal with some larger crawls and don't want to bother 
> doing the math yourself.
> {code}
> Dumper File Stats: 
> TOTAL Stats:
> [
> {"mimeType":"application/xhtml+xml","count":"2"}
> {"mimeType":"application/octet-stream","count":"1"}
> {"mimeType":"text/html","count":"23"}
> ]
> Total count: 26
> FILTERED Stats:
> [
> {"mimeType":"text/html","count":"23"}
> ]
> Total filtered count: 23
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2077) Upgrade to Tika 1.10

2015-08-28 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720279#comment-14720279
 ] 

Michael Joyce commented on NUTCH-2077:
--

Hey folks, updated tika to 1.10. If there was other stuff this ticket was 
hoping to address let me know and I'll update the patch.

 Upgrade to Tika 1.10
 

 Key: NUTCH-2077
 URL: https://issues.apache.org/jira/browse/NUTCH-2077
 Project: Nutch
  Issue Type: Improvement
Reporter: Tyler Palsulich





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2088) Add Optional Execution to Interactive Selenium Handlers

2015-08-28 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2088:


 Summary: Add Optional Execution to Interactive Selenium Handlers
 Key: NUTCH-2088
 URL: https://issues.apache.org/jira/browse/NUTCH-2088
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.11


At the moment, all the Handlers run for every URL when using the 
interactive-selenium plugin. Often times when trying to do a deep crawl of a 
site you'll want to handle various subdomains and paths/files differently. You 
can effectively filter in the handlers at the moment, but only once you've 
loaded the WebDriver and incurred the associated overhead. It would be much 
nicer if the handler interface allowed for this check to occur prior to the 
request to retrieve page content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2082) Upgrade to Apache Tika 1.10

2015-08-19 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703153#comment-14703153
 ] 

Michael Joyce commented on NUTCH-2082:
--

FYI, this is a duplicate of NUTCH-2077 I think [~lewismc].

 Upgrade to Apache Tika 1.10
 ---

 Key: NUTCH-2082
 URL: https://issues.apache.org/jira/browse/NUTCH-2082
 Project: Nutch
  Issue Type: Improvement
  Components: build, plugin
Affects Versions: 2.3, 1.10
Reporter: Lewis John McGibbney
 Fix For: 1.11, 2.3.1


 Tika 1.10 is hot
 http://search.maven.org/#artifactdetails|org.apache.tika|tika|1.10|pom
 Lets upgrade



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-18 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701707#comment-14701707
 ] 

Michael Joyce commented on NUTCH-2049:
--

Great stuff Lewis. Builds and runs cleanly locally for me. I also scoped a test 
that was run on EMR with 2.4.0 and all looks good.

 Upgrade Trunk to Hadoop  2.4 stable
 

 Key: NUTCH-2049
 URL: https://issues.apache.org/jira/browse/NUTCH-2049
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch


 Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
 I am +1 for taking trunk (or a branch of trunk) to explicit dependency on  
 Hadoop 2.6.
 We can run our tests, we can validate, we can fix.
 I will be doing validation on 2.X in paralegal as this is what I use on my 
 own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable

2015-08-12 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694210#comment-14694210
 ] 

Michael Joyce commented on NUTCH-2049:
--

Hey [~lewismc],

Tried your patch here. Seems I have to add the following to the ivy.xml file to 
get this to work at all

{code}
dependency org=org.apache.hadoop name=hadoop-mapreduce-client-jobclient 
rev=2.4.0 conf=*-default/
{code}

Otherwise, I end up getting the following when I try to run a test crawl

{code}
Injector: starting at 2015-08-12 15:04:42
Injector: crawlDb: crawl/crawldb
Injector: urlDir: ../../urls_test
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Cannot initialize Cluster. Please check your 
configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
{code}

However, after addressing that concern I end up runnign into the following on 
the test crawl

{code}
java.lang.Exception: java.lang.ClassCastException: 
org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to 
org.apache.hadoop.io.MapFile$Writer$Option
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.ClassCastException: 
org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to 
org.apache.hadoop.io.MapFile$Writer$Option
at 
org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:70)
at 
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:484)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-08-12 14:24:39,906 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:496)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:532)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:505)
{code}

 Upgrade Trunk to Hadoop  2.4 stable
 

 Key: NUTCH-2049
 URL: https://issues.apache.org/jira/browse/NUTCH-2049
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: NUTCH-2049.patch


 Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html
 I am +1 for taking trunk (or a branch of trunk) to explicit dependency on  
 Hadoop 2.6.
 We can run our tests, we can validate, we can fix.
 I will be doing validation on 2.X in paralegal as this is what I use on my 
 own projects. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-29 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646423#comment-14646423
 ] 

Michael Joyce commented on NUTCH-2062:
--

Hi folks,

Is there something I need to do to get this merged? Anything missing from the 
updated pull request? I'm happy to update as needed!

 Add Plugin for interacting with Selenium WebDriver
 --

 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Michael Joyce
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2062v2.patch


 The protocol-selenium plugin is great for pulling webpages that dynamically 
 load content. However, I've run into use cases where I need to actively 
 interact with a page in Selenium before it becomes useful. For instance, I 
 may need to paginate through a table to get all results that I'm interested 
 in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-29 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646488#comment-14646488
 ] 

Michael Joyce edited comment on NUTCH-2062 at 7/29/15 5:50 PM:
---

Cheers Chris, responded on the PR.

Also, where are you getting a direct link for the comment from??

Edit:

Nevermind, I see where you got it from. I never noticed that before.

https://github.com/apache/nutch/pull/46#issuecomment-126030222



was (Author: mjoyce):
Cheers Chris, responded on the PR.

Also, where are you getting a direct link for the comment from??


 Add Plugin for interacting with Selenium WebDriver
 --

 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2062v2.patch


 The protocol-selenium plugin is great for pulling webpages that dynamically 
 load content. However, I've run into use cases where I need to actively 
 interact with a page in Selenium before it becomes useful. For instance, I 
 may need to paginate through a table to get all results that I'm interested 
 in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-29 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646488#comment-14646488
 ] 

Michael Joyce commented on NUTCH-2062:
--

Cheers Chris, responded on the PR.

Also, where are you getting a direct link for the comment from??


 Add Plugin for interacting with Selenium WebDriver
 --

 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Chris A. Mattmann
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-2062v2.patch


 The protocol-selenium plugin is great for pulling webpages that dynamically 
 load content. However, I've run into use cases where I need to actively 
 interact with a page in Selenium before it becomes useful. For instance, I 
 may need to paginate through a table to get all results that I'm interested 
 in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-27 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2048:
-
Attachment: NUTCH-2048_Joyce_20150727.patch

Updated the patch to set the sync attribute on retrieve so the lib directory 
should stay clean now.

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch, 
 NUTCH-2048_Joyce_20150723_2.patch, NUTCH-2048_Joyce_20150727.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X

2015-07-24 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641210#comment-14641210
 ] 

Michael Joyce commented on NUTCH-1936:
--

Ah this is absolutely awesome Lewis. Great job on this.

 GSoC 2015 - Move Nutch to Hadoop 2.X
 

 Key: NUTCH-1936
 URL: https://issues.apache.org/jira/browse/NUTCH-1936
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: gsoc2015
 Fix For: 2.4, 1.11

 Attachments: NUTCH-1939.patch


 The Nutch PMC 
 [discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] 
 ideas for a good 2015 GSoC project. It appears that porting the (trunk) 
 codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an 
 attractive option and one which would present an excellent learning 
 experience for a summer student.
 A more comprehensive description of this issue should be included within 
 either a mentor-defined project description or a successful student 
 application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-23 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639462#comment-14639462
 ] 

Michael Joyce commented on NUTCH-2048:
--

Alright, hopefully this one is a bit more on track =D

As for plugin dependencies docs, here a huge +1 from me. I don't know that I'm 
necessarily versed enough in the build to do it myself but it would be a great 
to get up on the wiki.

As for Tika upgrades, there's actually a how to in the parse-tika folder. I 
went through that and ended up with the current patch which seems to have 
addressed the duplicate dependency issues. Given the instructions I'm not 
really certain how we ended up with the duplicates in the first place though. 
Maybe the doc is a recent addition

{code}
1. Upgrade Tika depencency in trunk/ivy/ivy.xml

2. Upgrade Tika dependency in src/plugin/parse-tika/ivy.xml

3. Upgrade Tika's own dependencies in src/plugin/parse-tika/plugin.xml
   To get the list of dependencies and their versions execute:
   $ ant -f ./build-ivy.xml
   $ ls lib | sed 's/^/  library name=/g' | sed 's/$/\//g'
{code}

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch, 
 NUTCH-2048_Joyce_20150723_2.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-23 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2048:
-
Attachment: NUTCH-2048_Joyce_20150723_2.patch

Patch #2 up. Explanation to follow shortly

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch, 
 NUTCH-2048_Joyce_20150723_2.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-23 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639396#comment-14639396
 ] 

Michael Joyce commented on NUTCH-2048:
--

Ah I clearly didn't pay enough attention to this [~wastl-nagel]. I was 
wondering why the heck you didn't just fix it yourself when you opened it ;)

I'll see what I can do

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml

2015-07-23 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2048:
-
Attachment: NUTCH-2048_Joyce_20150723.patch

Quick patch up for this.

 parse-tika: fix dependencies in plugin.xml
 --

 Key: NUTCH-2048
 URL: https://issues.apache.org/jira/browse/NUTCH-2048
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.11

 Attachments: NUTCH-2048_Joyce_20150723.patch


 Duplicate library dependencies listed in parse-tika's plugin.xml should be 
 cleaned up. There are a duplicates, only the version differs, e.g.:
 {noformat}
 tika-parsers-1.7.jar
 tika-parsers-1.8.jar
 {noformat}
 Not critical because libs which are not present should be just ignored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-22 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2063:
-
Labels: memex  (was: )

 Add -mimeStats flag to FileDumper tool
 --

 Key: NUTCH-2063
 URL: https://issues.apache.org/jira/browse/NUTCH-2063
 Project: Nutch
  Issue Type: Bug
  Components: dumpers
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Michael Joyce
  Labels: memex
 Fix For: 1.11

 Attachments: nutch-2063-joyce-21July2015.patch


 Right now in order to get a MimeType distribution for any given number of 
 segments, one is required to dump some data. This is a waste if one just 
 wishes to see the mime type distribution across a number of segments.
 An improvement to the FileDumper tool would be the addition of a -mimeStats 
 flag which would not attempt to dump any data but instead merely provide the 
 total stats message providing insight into how the FileDumper should be best 
 used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2004) ParseChecker does not handle redirects

2015-07-22 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2004:
-
Labels: memex  (was: )

 ParseChecker does not handle redirects
 --

 Key: NUTCH-2004
 URL: https://issues.apache.org/jira/browse/NUTCH-2004
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Assignee: Michael Joyce
Priority: Minor
  Labels: memex
 Fix For: 1.11


 At the moment ParseChecker doesn't handle redirects. If it gets anything but 
 a success status it errors out. It would be nice if it handled redirects a 
 bit more gracefully based on the http.redirects config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-22 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636958#comment-14636958
 ] 

Michael Joyce commented on NUTCH-2062:
--

Cheers [~lewismc], let me see what I can do with regards to updating the PR 
with these updates.

 Add Plugin for interacting with Selenium WebDriver
 --

 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
Assignee: Michael Joyce
 Fix For: 1.11

 Attachments: NUTCH-2062v2.patch


 The protocol-selenium plugin is great for pulling webpages that dynamically 
 load content. However, I've run into use cases where I need to actively 
 interact with a page in Selenium before it becomes useful. For instance, I 
 may need to paginate through a table to get all results that I'm interested 
 in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-21 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635389#comment-14635389
 ] 

Michael Joyce commented on NUTCH-2062:
--

Hi folks,

Just wanted to elaborate a bit on what this does at the moment and what the 
point of it is. This plugin is effectively the protocol-selenium plugin but it 
allows for a handler to interact with the WebDriver before returning the page 
content. Handlers require a simple interface to be implemented. Which 
handler(s) are run is determined by setting the class name of the handler in a 
comma separated list in the config. For each URL, all the handlers are run in 
config-specified order. The resulting content from each driver is appended 
together and returned as the content.

 Add Plugin for interacting with Selenium WebDriver
 --

 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.11


 The protocol-selenium plugin is great for pulling webpages that dynamically 
 load content. However, I've run into use cases where I need to actively 
 interact with a page in Selenium before it becomes useful. For instance, I 
 may need to paginate through a table to get all results that I'm interested 
 in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-21 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635706#comment-14635706
 ] 

Michael Joyce commented on NUTCH-2063:
--

Hey [~lewismc], threw a patch up for this. Let me know if you want to change 
something.

 Add -mimeStats flag to FileDumper tool
 --

 Key: NUTCH-2063
 URL: https://issues.apache.org/jira/browse/NUTCH-2063
 Project: Nutch
  Issue Type: Bug
  Components: dumpers
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: nutch-2063-joyce-21July2015.patch


 Right now in order to get a MimeType distribution for any given number of 
 segments, one is required to dump some data. This is a waste if one just 
 wishes to see the mime type distribution across a number of segments.
 An improvement to the FileDumper tool would be the addition of a -mimeStats 
 flag which would not attempt to dump any data but instead merely provide the 
 total stats message providing insight into how the FileDumper should be best 
 used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool

2015-07-21 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-2063:
-
Attachment: nutch-2063-joyce-21July2015.patch

 Add -mimeStats flag to FileDumper tool
 --

 Key: NUTCH-2063
 URL: https://issues.apache.org/jira/browse/NUTCH-2063
 Project: Nutch
  Issue Type: Bug
  Components: dumpers
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: nutch-2063-joyce-21July2015.patch


 Right now in order to get a MimeType distribution for any given number of 
 segments, one is required to dump some data. This is a waste if one just 
 wishes to see the mime type distribution across a number of segments.
 An improvement to the FileDumper tool would be the addition of a -mimeStats 
 flag which would not attempt to dump any data but instead merely provide the 
 total stats message providing insight into how the FileDumper should be best 
 used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-20 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2062:


 Summary: Add Plugin for interacting with Selenium WebDriver
 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.11


The protocol-selenium plugin is great for pulling webpages that dynamically 
load content. However, I've run into use cases where I need to actively 
interact with a page in Selenium before it becomes useful. For instance, I may 
need to paginate through a table to get all results that I'm interested in. 
This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-07-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633731#comment-14633731
 ] 

Michael Joyce commented on NUTCH-2062:
--

Hi folks,

I have a work-in progress locally for this. I'm working on making some more 
changes and should hopefully have something useful up soon for feedback

 Add Plugin for interacting with Selenium WebDriver
 --

 Key: NUTCH-2062
 URL: https://issues.apache.org/jira/browse/NUTCH-2062
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Michael Joyce
 Fix For: 1.11


 The protocol-selenium plugin is great for pulling webpages that dynamically 
 load content. However, I've run into use cases where I need to actively 
 interact with a page in Selenium before it becomes useful. For instance, I 
 may need to paginate through a table to get all results that I'm interested 
 in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1504) Pluggable url partitioner

2015-06-24 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599958#comment-14599958
 ] 

Michael Joyce commented on NUTCH-1504:
--

This is great stuff [~lewismc], we definitely need to get this in there. Would 
help us out a great deal.

 Pluggable url partitioner
 -

 Key: NUTCH-1504
 URL: https://issues.apache.org/jira/browse/NUTCH-1504
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.6
Reporter: Sourajit Basak
Assignee: Lewis John McGibbney
 Fix For: 1.11

 Attachments: custom.partitioner.patch


 At present, the url partition logic is hard wired inside nutch core. It 
 should be pluggable like FetchSchedule customized via nutch-site.xml.
 There might be use cases where a single domain needs to be partioned on some 
 custom logic. The existing UrlPartitioner cannot handle such cases. 
 Hence the requirement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time

2015-06-22 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596832#comment-14596832
 ] 

Michael Joyce commented on NUTCH-2045:
--

+1 this is great

 index-basic incorrect assignment of next fetch time (page.getFetchTime()) as 
 page fetch time
 

 Key: NUTCH-2045
 URL: https://issues.apache.org/jira/browse/NUTCH-2045
 Project: Nutch
  Issue Type: Bug
  Components: plugin
Affects Versions: 2.3, 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.11, 2.3.1

 Attachments: NUTCH-2045.patch


 The issue here as flagged up when using indexer-elastic plugin where the page 
 fetch time is incorrectly assigned as the NEXT fetch time as oppose to the 
 time at which the page was actually fetched (prevFetchTime).
 The ML thread for this issue can be found below
 http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2004) ParseChecker does not handle redirects

2015-04-29 Thread Michael Joyce (JIRA)
Michael Joyce created NUTCH-2004:


 Summary: ParseChecker does not handle redirects
 Key: NUTCH-2004
 URL: https://issues.apache.org/jira/browse/NUTCH-2004
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor


At the moment ParseChecker doesn't handle redirects. If it gets anything but a 
success status it errors out. It would be nice if it handled redirects a bit 
more gracefully based on the http.redirects config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2004) ParseChecker does not handle redirects

2015-04-29 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520028#comment-14520028
 ] 

Michael Joyce commented on NUTCH-2004:
--

Hi folks, will try to get a patch thrown up shortly for this.

 ParseChecker does not handle redirects
 --

 Key: NUTCH-2004
 URL: https://issues.apache.org/jira/browse/NUTCH-2004
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor

 At the moment ParseChecker doesn't handle redirects. If it gets anything but 
 a success status it errors out. It would be nice if it handled redirects a 
 bit more gracefully based on the http.redirects config setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503746#comment-14503746
 ] 

Michael Joyce commented on NUTCH-1934:
--

Hey [~lewismc], 

Patch applied clean to trunk for me and simple crawl over one site worked just 
fine. Couldn't run the tests unfortunately since I seem to have some config 
problem locally, but hopefully that's a start at least.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503727#comment-14503727
 ] 

Michael Joyce commented on NUTCH-1934:
--

Once sec Lewis and I'll take a quick scope.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-20 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503446#comment-14503446
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hi folks, PR has been updated with the requested changes. If you have any 
questions or think anything else needs changing let me know.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-18 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501674#comment-14501674
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hey Chris,

Will do. I'll try to take a poke at updating this tomorrow/Monday when I have a 
bit of free time.

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-16 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1986:
-
Labels: memex  (was: )

 Clarify Elastic Search Indexer Plugin Settings
 --

 Key: NUTCH-1986
 URL: https://issues.apache.org/jira/browse/NUTCH-1986
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, indexer, plugin
Affects Versions: 1.9
Reporter: Michael Joyce
  Labels: memex
 Fix For: 1.10


 Was working on getting indexing into elastic search working and realized that 
 the majority of my difficulties were simply me misunderstanding what the 
 config needed. Patch incoming to hopefully clarify what is needed by default, 
 what each option does, and add any helpful defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing

2015-04-16 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498689#comment-14498689
 ] 

Michael Joyce commented on NUTCH-1911:
--

Hey folks,

Here's what the output from this looks like

{code}
Usage: DomainStatistics inputDirs outDir mode [numOfReducer]
inputDirs   Comma separated list of crawldb input directories
E.g.: crawl/crawldb/current/
outDir  Output directory where results should be dumped
modeSet statistics gathering mode
hostGather statistics by host
domain  Gather statistics by domain
suffix  Gather statistics by suffix
tld Gather statistics by top level directory
[numOfReducers] Optional number of reduce jobs to use. Defaults to 1.
{code}

 Imeprove DomainStatistics tool command line parsing
 ---

 Key: NUTCH-1911
 URL: https://issues.apache.org/jira/browse/NUTCH-1911
 Project: Nutch
  Issue Type: Bug
  Components: util
Affects Versions: 1.9, 2.2.1
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.11


 The DomainStatistic's tool could be improved based on the comments addressed 
 in [this mai 
 thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
 For convenience, I've also pasted them below
 {quote}
 You cannot just tell it where the crawldb is, you need to tell it where the 
 directory is, so specifying current is ok, but not part-*
 {quote}
 Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional

2015-04-16 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1988:
-
Labels: memex  (was: )

 Make nested output directory dump optional
 --

 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
Priority: Minor
  Labels: memex
 Fix For: 1.10


 NUTCH-1957 added nested directories to the bin/nutch dump output to help 
 avoid naming conflicts in output files. It would be nice to be able to 
 specify that you want the older flat directory output as an optional 
 parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1906) Typo in CrawlDbReader command line help

2015-04-16 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498573#comment-14498573
 ] 

Michael Joyce commented on NUTCH-1906:
--

Hi folks,

I'll throw a patch up shortly for this.

 Typo in CrawlDbReader command line help
 ---

 Key: NUTCH-1906
 URL: https://issues.apache.org/jira/browse/NUTCH-1906
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.9
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.11


 Currently the CrawlDbReader tool, when invoked without any command line 
 arguments helps us as follows
 {code}
 [mdeploy@crawl local]$ ./bin/nutch readdb
 Usage: CrawlDbReader crawldb (-stats | -dump out_dir | -topN  
 out_dir [min] | -url url)
   crawldb   directory name where crawldb is located
   -stats [-sort]  print overall statistics to System.out
   [-sort] list status sorted by host
   -dump out_dir [-format normal|csv|crawldb]dump the whole db to a 
 text file in out_dir
   [-format csv]   dump in Csv format
   [-format normal]dump in standard format (default option)
   [-format crawldb]   dump as CrawlDB
   [-regex expr] filter records with expression
   [-retry num]  minimum retry count
   [-status status]  filter records by CrawlDatum status
   -url url  print information on url to System.out
   -topN  out_dir [min]  dump top  urls sorted by score to 
 out_dir
   [min] skip records with scores below this value.
   This can significantly improve performance.
 {code}
 The code that bothers me is
 {code}
   -stats [-sort]  print overall statistics to System.out
   [-sort] list status sorted by host
 {code}
 The inclusion of the double -sort is not necessary or required.
 Having looked through the code there is no other optional flag which we can 
 substitute for the second one (which I thought may lead to this being a 
 placeholder for something else) therefore we can just remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-16 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1987:
-
Labels: memex  (was: )

 Make bin/crawl indexer agnostic
 ---

 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
  Labels: memex
 Fix For: 1.10


 The crawl script makes it a bit challenging to use an indexer that isn't 
 Solr. For instance, when I want to use the indexer-elastic plugin I still 
 need to call the crawler script with a fake Solr URL otherwise it will skip 
 the indexing step all together.
 {code}
 bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1
 {code}
 It would be nice to keep configuration for the Solr indexer in the conf files 
 (to mirror the elastic search indexer conf and others) and to make the 
 indexing parameter simply toggle whether indexing does or doesn't occur 
 instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >