[jira] [Resolved] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI
[ https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2218. -- Resolution: Fixed [~lewismc], This got merged. I added an example to the option you raised as well. If that doesn't address your concerns let me know and I'll update in another ticket. {code} | -> ./bin/nutch crawlcomplete usage: CrawlCompletionStats [-h] -inputDirs -mode [-numReducers ] -outputDir -h,--helpShow this message -inputDirsComma separated list of crawl directories (e.g., "./crawl1,./crawl2") -mode Set statistics gathering mode (by 'host' or by 'domain') -numReducersOptional number of reduce jobs to use. Defaults to 1 -outputDirOutput directory where results should be dumped {code} > Switch CrawlCompletion arg parsing to Commons CLI > - > > Key: NUTCH-2218 > URL: https://issues.apache.org/jira/browse/NUTCH-2218 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.11 >Reporter: Michael Joyce >Assignee: Michael Joyce >Priority: Minor > Fix For: 1.12 > > > The current CrawlCompletion utility should be updated to use commons CLI > instead of doing manual arg parsing and checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI
[ https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152731#comment-15152731 ] Michael Joyce commented on NUTCH-2218: -- Sorry for any confusion here folks. Changes were merged in r1731102. README was updated in 1731103 since I forgot to update. PR should be closed in 1731103. > Switch CrawlCompletion arg parsing to Commons CLI > - > > Key: NUTCH-2218 > URL: https://issues.apache.org/jira/browse/NUTCH-2218 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.11 >Reporter: Michael Joyce >Assignee: Michael Joyce >Priority: Minor > Fix For: 1.12 > > > The current CrawlCompletion utility should be updated to use commons CLI > instead of doing manual arg parsing and checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2218) Switch CrawlCompletion arg parsing to Commons CLI
[ https://issues.apache.org/jira/browse/NUTCH-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2218: - Issue Type: Improvement (was: Bug) > Switch CrawlCompletion arg parsing to Commons CLI > - > > Key: NUTCH-2218 > URL: https://issues.apache.org/jira/browse/NUTCH-2218 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.11 >Reporter: Michael Joyce >Assignee: Michael Joyce >Priority: Minor > Fix For: 1.12 > > > The current CrawlCompletion utility should be updated to use commons CLI > instead of doing manual arg parsing and checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2187) Change FileDumper SHAs to all uppercase
Michael Joyce created NUTCH-2187: Summary: Change FileDumper SHAs to all uppercase Key: NUTCH-2187 URL: https://issues.apache.org/jira/browse/NUTCH-2187 Project: Nutch Issue Type: Improvement Components: tool Affects Versions: 1.11 Reporter: Michael Joyce Assignee: Michael Joyce Fix For: 1.12 It would be nice to have the reverseUrlDirs options dump SHAs in all uppercase for consistency -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2187) Change FileDumper SHAs to all uppercase
[ https://issues.apache.org/jira/browse/NUTCH-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2187. -- Resolution: Duplicate Going to just resolve this in NUTCH-2182. Thought that patch had already been committed. > Change FileDumper SHAs to all uppercase > --- > > Key: NUTCH-2187 > URL: https://issues.apache.org/jira/browse/NUTCH-2187 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 1.11 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.12 > > > It would be nice to have the reverseUrlDirs options dump SHAs in all > uppercase for consistency -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2182) Make reverseUrlDirs file dumper option hash the URL for consistency
[ https://issues.apache.org/jira/browse/NUTCH-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2182. -- Resolution: Fixed Resolved in r1720466 > Make reverseUrlDirs file dumper option hash the URL for consistency > --- > > Key: NUTCH-2182 > URL: https://issues.apache.org/jira/browse/NUTCH-2182 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 1.11 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.12 > > Attachments: NUTCH-2182_joyce_8Dec2015.patch > > > At the moment the "reverseUrlDirs" option for FileDumper is terribly brittle > and fails on a fair number of edge cases. A more robust way to handle the > reverse URL approach to dumping a file is to reverse the server part and hash > the URL to use as the file name. This gives us a nice split of files while > avoiding a number of likely classes that causes dumps to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2180) FileDumper dumps data, but breaks midway on corrupt segments
[ https://issues.apache.org/jira/browse/NUTCH-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048889#comment-15048889 ] Michael Joyce commented on NUTCH-2180: -- Thanks for the patch [~hmanjuna], will scope shortly > FileDumper dumps data, but breaks midway on corrupt segments > > > Key: NUTCH-2180 > URL: https://issues.apache.org/jira/browse/NUTCH-2180 > Project: Nutch > Issue Type: Bug > Components: bin, dumpers >Affects Versions: 1.11 > Environment: Ubuntu 14.04.3 x64 >Reporter: Harshavardhan Manjunatha >Assignee: Michael Joyce > Fix For: 1.11 > > > FileDumper should ignore corrupt segments, and continue dumping data instead > of throwing NullPointerException > $ bin/nutch dump -segment ../../../segments/ -outputDir ./firstDump/ -flatdir > java.lang.NullPointerException > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:175) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:417) > $ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2180) FileDumper dumps data, but breaks midway on corrupt segments
[ https://issues.apache.org/jira/browse/NUTCH-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce reassigned NUTCH-2180: Assignee: Michael Joyce > FileDumper dumps data, but breaks midway on corrupt segments > > > Key: NUTCH-2180 > URL: https://issues.apache.org/jira/browse/NUTCH-2180 > Project: Nutch > Issue Type: Bug > Components: bin, dumpers >Affects Versions: 1.11 > Environment: Ubuntu 14.04.3 x64 >Reporter: Harshavardhan Manjunatha >Assignee: Michael Joyce > Fix For: 1.11 > > > FileDumper should ignore corrupt segments, and continue dumping data instead > of throwing NullPointerException > $ bin/nutch dump -segment ../../../segments/ -outputDir ./firstDump/ -flatdir > java.lang.NullPointerException > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:175) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:417) > $ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2182) Make reverseUrlDirs file dumper option hash the URL for consistency
[ https://issues.apache.org/jira/browse/NUTCH-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2182: - Attachment: NUTCH-2182_joyce_8Dec2015.patch Patch Attached > Make reverseUrlDirs file dumper option hash the URL for consistency > --- > > Key: NUTCH-2182 > URL: https://issues.apache.org/jira/browse/NUTCH-2182 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 1.11 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.12 > > Attachments: NUTCH-2182_joyce_8Dec2015.patch > > > At the moment the "reverseUrlDirs" option for FileDumper is terribly brittle > and fails on a fair number of edge cases. A more robust way to handle the > reverse URL approach to dumping a file is to reverse the server part and hash > the URL to use as the file name. This gives us a nice split of files while > avoiding a number of likely classes that causes dumps to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2182) Make reverseUrlDirs file dumper option hash the URL for consistency
Michael Joyce created NUTCH-2182: Summary: Make reverseUrlDirs file dumper option hash the URL for consistency Key: NUTCH-2182 URL: https://issues.apache.org/jira/browse/NUTCH-2182 Project: Nutch Issue Type: Improvement Components: tool Affects Versions: 1.11 Reporter: Michael Joyce Assignee: Michael Joyce Fix For: 1.12 At the moment the "reverseUrlDirs" option for FileDumper is terribly brittle and fails on a fair number of edge cases. A more robust way to handle the reverse URL approach to dumping a file is to reverse the server part and hash the URL to use as the file name. This gives us a nice split of files while avoiding a number of likely classes that causes dumps to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2158) Upgrade to Tika 1.11
[ https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023388#comment-15023388 ] Michael Joyce commented on NUTCH-2158: -- +1 on this. Looks good to me > Upgrade to Tika 1.11 > > > Key: NUTCH-2158 > URL: https://issues.apache.org/jira/browse/NUTCH-2158 > Project: Nutch > Issue Type: Task > Components: parser >Reporter: Chris A. Mattmann >Assignee: Julien Nioche > Fix For: 1.11 > > Attachments: NUTCH-2158-test-protocol-http.patch, NUTCH-2158.patch > > > Upgrade parse-tika to 1.11 release for Tika. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2173) String.join in FileDumper breaks the build
Michael Joyce created NUTCH-2173: Summary: String.join in FileDumper breaks the build Key: NUTCH-2173 URL: https://issues.apache.org/jira/browse/NUTCH-2173 Project: Nutch Issue Type: Bug Components: tool Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Michael Joyce Fix For: 1.11 The new FileDumper changes use a 1.8 String function that breaks the build on 1.7 {code} String.join {code} Thanks [~kwhitehall] for finding this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2173) String.join in FileDumper breaks the build
[ https://issues.apache.org/jira/browse/NUTCH-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2173 started by Michael Joyce. > String.join in FileDumper breaks the build > -- > > Key: NUTCH-2173 > URL: https://issues.apache.org/jira/browse/NUTCH-2173 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > The new FileDumper changes use a 1.8 String function that breaks the build on > 1.7 > {code} > String.join > {code} > Thanks [~kwhitehall] for finding this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2173) String.join in FileDumper breaks the build
[ https://issues.apache.org/jira/browse/NUTCH-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2173. -- Resolution: Fixed Resolve in r1715046 > String.join in FileDumper breaks the build > -- > > Key: NUTCH-2173 > URL: https://issues.apache.org/jira/browse/NUTCH-2173 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > The new FileDumper changes use a 1.8 String function that breaks the build on > 1.7 > {code} > String.join > {code} > Thanks [~kwhitehall] for finding this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2166) Add reverse URL format to dump tool
[ https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2166. -- Resolution: Fixed Committed in r1714908 > Add reverse URL format to dump tool > --- > > Key: NUTCH-2166 > URL: https://issues.apache.org/jira/browse/NUTCH-2166 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > Attachments: NUTCH-2166_joyce_13Nov2015.patch > > > Update the FileDumper tool with an option for dumping files to the output > directory in reverse URL format. > So the file for > http://bar.foo.com:8983/to/index.html?a=b > Would dump to > /com/foo/bar/8983/http/to/index.html?a=b -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool
[ https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15004191#comment-15004191 ] Michael Joyce commented on NUTCH-2166: -- Output from a small example run. I don't know that I'm terribly happy with the _file solution. Open to ideas on that. {code} dumpoutputtest/ ├── edu │ └── caltech │ └── www │ └── http │ └── _file └── gov └── nasa ├── eyes │ └── http │ ├── _file │ ├── earth │ │ └── _file │ └── exoplanets │ └── _file ├── jpl │ ├── blogs │ │ └── http │ │ └── _file │ ├── http │ │ └── _file │ ├── mars │ │ └── http │ │ └── _file │ ├── photojournal │ │ └── http │ │ └── _file │ ├── planetquest │ │ └── http │ │ └── _file │ └── www │ └── http │ ├── _file │ ├── about │ │ ├── _file │ │ ├── exec.php │ │ ├── history.php │ │ └── reports.php │ ├── apps │ │ └── _file │ ├── asteroidwatch │ │ └── _file │ ├── contact_JPL.php │ ├── edu │ │ ├── _file │ │ ├── events │ │ │ ├── 2015 │ │ │ │ └── 11 │ │ │ │ └── 1 │ │ │ │ └── see-the-phases-of-the-moon-by-day-and-by-night │ │ │ │ └── _file │ │ │ └── _file │ │ ├── intern │ │ │ └── _file │ │ ├── learn │ │ │ └── _file │ │ ├── news │ │ │ └── _file │ │ └── teach │ │ └── _file │ ├── events │ │ ├── _file │ │ ├── lectures.php │ │ ├── open-house.php │ │ ├── speakers-bureau.php │ │ ├── team-competitions.php │ │ └── tours │ │ └── views │ │ └── _file │ ├── infographics │ │ └── _file │ ├── missions │ │ └── _file │ ├── multimedia │ │ └── audio.php │ ├── news │ │ ├── _file │ │ ├── factsheets.php │ │ ├── mediaroom.php │ │ └── presskits.php │ ├── opportunities │ │ └── _file │ ├── social │ │ └── _file │ ├── spaceimages │ │ └── _file │ └── videos │ └── _file ├── solarsystem │ └── http │ └── _file └── www └── http ├── _file └── earthrightnow └── _file 51 directories, 44 files {code} > Add reverse URL format to dump tool > --- > > Key: NUTCH-2166 > URL: https://issues.apache.org/jira/browse/NUTCH-2166 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > Attachments: NUTCH-2166_joyce_13Nov2015.patch > > > Update the FileDumper tool with an option for dumping files to the output > directory in reverse URL format. > So the file for > http://bar.foo.com:8983/to/index.html?a=b > Would dump to > /com/foo/bar/8983/http/to/index.html?a=b -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool
[ https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002328#comment-15002328 ] Michael Joyce commented on NUTCH-2166: -- Small change in dump format. Instead of making a bajillion nested folders it seems like it might be nicer to simple use the reverse URL as the file name. So the file for http://bar.foo.com:8983/to/index.htm Would dump to the encoded /com%2Ffoo%2Fbar%2F8983%2Fhttp%2Fto%2Findex.htm Of course, we may then run into file name length issues this way. Perhaps having both eventually will be useful? > Add reverse URL format to dump tool > --- > > Key: NUTCH-2166 > URL: https://issues.apache.org/jira/browse/NUTCH-2166 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > > Update the FileDumper tool with an option for dumping files to the output > directory in reverse URL format. > So the file for > http://bar.foo.com:8983/to/index.html?a=b > Would dump to > /com/foo/bar/8983/http/to/index.html?a=b -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing
[ https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2167. -- Resolution: Fixed TableUtil copied over in r1714078 and tests copied over in 1714079 > Backport TableUtil from 2.x for URL reversing > - > > Key: NUTCH-2167 > URL: https://issues.apache.org/jira/browse/NUTCH-2167 > Project: Nutch > Issue Type: Sub-task > Components: tool >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > The > [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java] > file provides a number of helpful utilities functions for URL reversing that > would be useful to have in 1.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002604#comment-15002604 ] Michael Joyce commented on NUTCH-2165: -- Thanks [~lewismc], I'll merge shortly > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > Attachments: NUTCH-2165_joyce_11Nov2015.patch > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2165. -- Resolution: Fixed Committed in r1714104 > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > Attachments: NUTCH-2165_joyce_11Nov2015.patch > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2155. -- Resolution: Fixed Latest patch committed in r1713885 > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2155_joyce_9Nov2015.patch > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2150) Add ProtocolStatus Utility
[ https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-2150. -- Resolution: Fixed Resolved in r1713892 > Add ProtocolStatus Utility > -- > > Key: NUTCH-2150 > URL: https://issues.apache.org/jira/browse/NUTCH-2150 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: NUTCH-2015_joyce_9Nov2015.patch > > > It would be nice to have a utility for dumping protocol status code > information for a crawl database. This will be a utility for getting a dump > of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing
[ https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000841#comment-15000841 ] Michael Joyce commented on NUTCH-2167: -- Hi folks, All looks good and tests run fine after moving this over for testing. I'm going to svn cp them over if no one has any objections. > Backport TableUtil from 2.x for URL reversing > - > > Key: NUTCH-2167 > URL: https://issues.apache.org/jira/browse/NUTCH-2167 > Project: Nutch > Issue Type: Sub-task > Components: tool >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > The > [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java] > file provides a number of helpful utilities functions for URL reversing that > would be useful to have in 1.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing
[ https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2167 started by Michael Joyce. > Backport TableUtil from 2.x for URL reversing > - > > Key: NUTCH-2167 > URL: https://issues.apache.org/jira/browse/NUTCH-2167 > Project: Nutch > Issue Type: Sub-task > Components: tool >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > The > [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java] > file provides a number of helpful utilities functions for URL reversing that > would be useful to have in 1.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1911 started by Michael Joyce. > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.10, 1.11 > > Attachments: NUTCH-1911_joyce_9Nov2015.patch, > NUTCH-1911_joyce_9Nov2015.patch > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce resolved NUTCH-1911. -- Resolution: Fixed Resolved in r1713890 > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.11, 1.10 > > Attachments: NUTCH-1911_joyce_9Nov2015.patch, > NUTCH-1911_joyce_9Nov2015.patch > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2150) Add ProtocolStatus Utility
[ https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2150 started by Michael Joyce. > Add ProtocolStatus Utility > -- > > Key: NUTCH-2150 > URL: https://issues.apache.org/jira/browse/NUTCH-2150 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: NUTCH-2015_joyce_9Nov2015.patch > > > It would be nice to have a utility for dumping protocol status code > information for a crawl database. This will be a utility for getting a dump > of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2155 started by Michael Joyce. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2155_joyce_9Nov2015.patch > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2166) Add reverse URL format to dump tool
Michael Joyce created NUTCH-2166: Summary: Add reverse URL format to dump tool Key: NUTCH-2166 URL: https://issues.apache.org/jira/browse/NUTCH-2166 Project: Nutch Issue Type: Improvement Components: tool Affects Versions: 1.10, 2.3 Reporter: Michael Joyce Assignee: Michael Joyce Fix For: 2.4, 1.11 Update the FileDumper tool with an option for dumping files to the output directory in reverse URL format. So the file for http://bar.foo.com:8983/to/index.html?a=b Would dump to /com/foo/bar/8983/http/to/index.html?a=b -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2166) Add reverse URL format to dump tool
[ https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2166 started by Michael Joyce. > Add reverse URL format to dump tool > --- > > Key: NUTCH-2166 > URL: https://issues.apache.org/jira/browse/NUTCH-2166 > Project: Nutch > Issue Type: Improvement > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > > Update the FileDumper tool with an option for dumping files to the output > directory in reverse URL format. > So the file for > http://bar.foo.com:8983/to/index.html?a=b > Would dump to > /com/foo/bar/8983/http/to/index.html?a=b -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2165) FileDumper Util hard codes part-# folder name
Michael Joyce created NUTCH-2165: Summary: FileDumper Util hard codes part-# folder name Key: NUTCH-2165 URL: https://issues.apache.org/jira/browse/NUTCH-2165 Project: Nutch Issue Type: Bug Components: tool Affects Versions: 1.10, 2.3 Reporter: Michael Joyce Fix For: 2.4, 1.11 Hi folks, [~lewismc] and I were just discussing this off list. It seems that the part-# folders seem to be hard coded to part-0 in the [FileDumper utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing
Michael Joyce created NUTCH-2167: Summary: Backport TableUtil from 2.x for URL reversing Key: NUTCH-2167 URL: https://issues.apache.org/jira/browse/NUTCH-2167 Project: Nutch Issue Type: Sub-task Components: tool Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Michael Joyce Fix For: 1.11 The [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java] file provides a number of helpful utilities functions for URL reversing that would be useful to have in 1.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2165 started by Michael Joyce. > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce reassigned NUTCH-2165: Assignee: Michael Joyce > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000910#comment-15000910 ] Michael Joyce commented on NUTCH-2165: -- Oh aye > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2165: - Attachment: NUTCH-2165_joyce_11Nov2015.patch Patch attached > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > Attachments: NUTCH-2165_joyce_11Nov2015.patch > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name
[ https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000923#comment-15000923 ] Michael Joyce commented on NUTCH-2165: -- Note, the diff looks massive here. This is really just adding an extra loop over the parts directories in each segment directory. The tool could probably use a bit of cleanup love, but we can address that in a later patch. > FileDumper Util hard codes part-# folder name > - > > Key: NUTCH-2165 > URL: https://issues.apache.org/jira/browse/NUTCH-2165 > Project: Nutch > Issue Type: Bug > Components: tool >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 2.4, 1.11 > > Attachments: NUTCH-2165_joyce_11Nov2015.patch > > > Hi folks, [~lewismc] and I were just discussing this off list. It seems that > the part-# folders seem to be hard coded to part-0 in the [FileDumper > utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167] > which could prove problematic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2150) Add ProtocolStatus Utility
[ https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce reassigned NUTCH-2150: Assignee: Michael Joyce (was: Chris A. Mattmann) > Add ProtocolStatus Utility > -- > > Key: NUTCH-2150 > URL: https://issues.apache.org/jira/browse/NUTCH-2150 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > > It would be nice to have a utility for dumping protocol status code > information for a crawl database. This will be a utility for getting a dump > of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce reassigned NUTCH-1911: Assignee: Michael Joyce (was: Chris A. Mattmann) > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.10 > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1911: - Summary: Improve DomainStatistics tool command line parsing (was: Imeprove DomainStatistics tool command line parsing) > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Chris A. Mattmann >Priority: Trivial > Fix For: 1.10 > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce reassigned NUTCH-2155: Assignee: Michael Joyce (was: Chris A. Mattmann) > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1911: - Fix Version/s: 1.10 > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.10, 1.11 > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2150) Add ProtocolStatus Utility
[ https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2150: - Attachment: NUTCH-2015_joyce_9Nov2015.patch Patch attached to clean up help formatting and drop need for "current" in crawldb path > Add ProtocolStatus Utility > -- > > Key: NUTCH-2150 > URL: https://issues.apache.org/jira/browse/NUTCH-2150 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Fix For: 1.11 > > Attachments: NUTCH-2015_joyce_9Nov2015.patch > > > It would be nice to have a utility for dumping protocol status code > information for a crawl database. This will be a utility for getting a dump > of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2155: - Attachment: NUTCH-2155_joyce_9Nov2015.patch Patch attached to address "current" requirements in crawldb path and add more helpful "help" info. > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2155_joyce_9Nov2015.patch > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1911: - Attachment: NUTCH-1911_joyce_9Nov2015.patch Attach more recent patch to include removal of requirement for "current" folder > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.10, 1.11 > > Attachments: NUTCH-1911_joyce_9Nov2015.patch, > NUTCH-1911_joyce_9Nov2015.patch > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1911: - Fix Version/s: (was: 1.10) 1.11 > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.11 > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1911) Improve DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1911: - Attachment: NUTCH-1911_joyce_9Nov2015.patch Going to resubmit the attached patch to get these changes back in the code base. > Improve DomainStatistics tool command line parsing > -- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Michael Joyce >Priority: Trivial > Fix For: 1.10, 1.11 > > Attachments: NUTCH-1911_joyce_9Nov2015.patch > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985431#comment-14985431 ] Michael Joyce commented on NUTCH-2155: -- +1 sounds good to me [~sebastien0], I will update it in a patch shortly > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility
[ https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985427#comment-14985427 ] Michael Joyce commented on NUTCH-2150: -- Yes, will address in a patch shortly. > Add ProtocolStatus Utility > -- > > Key: NUTCH-2150 > URL: https://issues.apache.org/jira/browse/NUTCH-2150 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Fix For: 1.11 > > > It would be nice to have a utility for dumping protocol status code > information for a crawl database. This will be a utility for getting a dump > of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985436#comment-14985436 ] Michael Joyce commented on NUTCH-1911: -- Hrm odd, I want to throw some commons-cli at a few of the utilities anyway so I'll just address this there. > Imeprove DomainStatistics tool command line parsing > --- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Assignee: Chris A. Mattmann >Priority: Trivial > Fix For: 1.10 > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility
Michael Joyce created NUTCH-2155: Summary: Create a "crawl completeness" utility Key: NUTCH-2155 URL: https://issues.apache.org/jira/browse/NUTCH-2155 Project: Nutch Issue Type: Improvement Components: util Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.12 I've found it useful to have a tool for dumping some "completeness" information from a crawl similar to how domainstats does but including fetched and unfetched counts per domain/host. This is especially nice when doing vertical crawls over a few domains or just to see how much of a host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2155) Create a "crawl completeness" utility
[ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979196#comment-14979196 ] Michael Joyce commented on NUTCH-2155: -- Should have a first patch up shortly for review folks > Create a "crawl completeness" utility > - > > Key: NUTCH-2155 > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2150) Add ProtocolStatus Utility
Michael Joyce created NUTCH-2150: Summary: Add ProtocolStatus Utility Key: NUTCH-2150 URL: https://issues.apache.org/jira/browse/NUTCH-2150 Project: Nutch Issue Type: Improvement Components: util Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.12 It would be nice to have a utility for dumping protocol status code information for a crawl database. This will be a utility for getting a dump of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2150) Add ProtocolStatus Utility
[ https://issues.apache.org/jira/browse/NUTCH-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977036#comment-14977036 ] Michael Joyce commented on NUTCH-2150: -- Hi folks, PR is up for this. You can run the util with something similar to the following {code} ./bin/nutch protocolstats crawl/crawldb/current/ ./output {code} And that will get you something along the lines of {code} 38 200 19 301 2 302 665 UNFETCHED {code} Let me know if you have any questions! > Add ProtocolStatus Utility > -- > > Key: NUTCH-2150 > URL: https://issues.apache.org/jira/browse/NUTCH-2150 > Project: Nutch > Issue Type: Improvement > Components: util >Affects Versions: 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > It would be nice to have a utility for dumping protocol status code > information for a crawl database. This will be a utility for getting a dump > of the protocol status codes that builds off of NUTCH-2129 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959659#comment-14959659 ] Michael Joyce commented on NUTCH-2141: -- Cool makes sense. Do you have any examples? I'd like to poke as well. You're going to need to handle the screenshot functionality differently as well. getHTMLContent does more than just return the body content. We probably don't really need the DefalultMultiInteractionHandler example either if this basically replaces that. [~asitang] might have some ideas as well. > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959345#comment-14959345 ] Michael Joyce commented on NUTCH-2141: -- This was actually brought up in NUTCH-2108. There's also an [example handler | https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java] that was added to illustrate that as well. The handler wont actually be run multiple times so if you need to return concatenated content you need to do it in the handler and make sure it's returned appropriately. > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947002#comment-14947002 ] Michael Joyce commented on NUTCH-2129: -- Fixed the unnecessary init that [~jnioche] caught. Thanks much for the reviews. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2133) Transfer Selenium Documentation to WIki
Michael Joyce created NUTCH-2133: Summary: Transfer Selenium Documentation to WIki Key: NUTCH-2133 URL: https://issues.apache.org/jira/browse/NUTCH-2133 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.10, 2.3 Reporter: Michael Joyce Fix For: 2.4, 1.11 There's a decent chunk of Selenium related documentation stuck in READMEs for various plugins. I would be nice to get this stuff pushed to the wiki. E.G.: https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-selenium/README.md -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945766#comment-14945766 ] Michael Joyce commented on NUTCH-2129: -- Hey folks, updated PR with the metadata approach for HTTP and FTP. Let me know if you have any concerns. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939939#comment-14939939 ] Michael Joyce commented on NUTCH-2129: -- Thanks Julien. I figured there would probably be a few thoughts on this, so I appreciate the feedback. I'll checkout the stuff you mentioned. Thanks for the ideas. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data
[ https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940036#comment-14940036 ] Michael Joyce commented on NUTCH-2108: -- Good stuff [~asitang], glad to see the workaround proved fruitful and great example handlers! > Add a function to the selenium interactive plugin interface to do multiple > manipulation of driver and then return the data > -- > > Key: NUTCH-2108 > URL: https://issues.apache.org/jira/browse/NUTCH-2108 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > In the interactive selenium plugin we have to create handler classes for each > manipulation of a page. Sometimes we need to manipulate a page in many ways > and keep track of those manipulations. Like clicking on say each link in a > table and then refreshing to get the original page back as even one click can > make all other links go away. This can be done in a single loop. Which will > be a little too much work and way complicated using multiple handlers. So, I > am proposing a new function "String multiProcessDriver(WebDriver driver)" > that takes the driver and returns a concatenated String along with the > already present "void processDriver(WebDriver driver)". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2129) Track Protocol Status in Crawl Datum
Michael Joyce created NUTCH-2129: Summary: Track Protocol Status in Crawl Datum Key: NUTCH-2129 URL: https://issues.apache.org/jira/browse/NUTCH-2129 Project: Nutch Issue Type: Improvement Affects Versions: 1.10, 2.3 Reporter: Michael Joyce Fix For: 2.4, 1.11 It's become necessary on a few crawls that I run to get protocol status code stats. After speaking with [~lewismc] it seemed that there might not be a super convenient way of doing this as is, but it would be great to be able to add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939124#comment-14939124 ] Michael Joyce commented on NUTCH-2129: -- Hi folks, Initial pull request up to address this. Note that at the moment this only includes updates such that HTTP status codes are saved. I figured it would be best to get a conversation started on this before I dive into it too much since it's a rather core data structure. Thoughts or ideas on this folks? > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2115) Add total counts to dump stats
Michael Joyce created NUTCH-2115: Summary: Add total counts to dump stats Key: NUTCH-2115 URL: https://issues.apache.org/jira/browse/NUTCH-2115 Project: Nutch Issue Type: Improvement Components: dumpers, util Affects Versions: 1.10 Reporter: Michael Joyce Priority: Minor Fix For: 1.11 It would be nice if the "dump" tool included total counts for the mimetype stats that it gives. Something along the lines of the following would be great when you have to deal with some larger crawls and don't want to bother doing the math yourself. {code} Dumper File Stats: TOTAL Stats: [ {"mimeType":"application/xhtml+xml","count":"2"} {"mimeType":"application/octet-stream","count":"1"} {"mimeType":"text/html","count":"23"} ] Total count: 26 FILTERED Stats: [ {"mimeType":"text/html","count":"23"} ] Total filtered count: 23 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2115) Add total counts to dump stats
[ https://issues.apache.org/jira/browse/NUTCH-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905156#comment-14905156 ] Michael Joyce commented on NUTCH-2115: -- Cheers [~lewismc], thanks for the quick merge! > Add total counts to dump stats > -- > > Key: NUTCH-2115 > URL: https://issues.apache.org/jira/browse/NUTCH-2115 > Project: Nutch > Issue Type: Improvement > Components: dumpers, util >Affects Versions: 1.10 >Reporter: Michael Joyce >Assignee: Michael Joyce >Priority: Minor > Fix For: 1.11 > > > It would be nice if the "dump" tool included total counts for the mimetype > stats that it gives. Something along the lines of the following would be > great when you have to deal with some larger crawls and don't want to bother > doing the math yourself. > {code} > Dumper File Stats: > TOTAL Stats: > [ > {"mimeType":"application/xhtml+xml","count":"2"} > {"mimeType":"application/octet-stream","count":"1"} > {"mimeType":"text/html","count":"23"} > ] > Total count: 26 > FILTERED Stats: > [ > {"mimeType":"text/html","count":"23"} > ] > Total filtered count: 23 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2077) Upgrade to Tika 1.10
[ https://issues.apache.org/jira/browse/NUTCH-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720279#comment-14720279 ] Michael Joyce commented on NUTCH-2077: -- Hey folks, updated tika to 1.10. If there was other stuff this ticket was hoping to address let me know and I'll update the patch. Upgrade to Tika 1.10 Key: NUTCH-2077 URL: https://issues.apache.org/jira/browse/NUTCH-2077 Project: Nutch Issue Type: Improvement Reporter: Tyler Palsulich -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2088) Add Optional Execution to Interactive Selenium Handlers
Michael Joyce created NUTCH-2088: Summary: Add Optional Execution to Interactive Selenium Handlers Key: NUTCH-2088 URL: https://issues.apache.org/jira/browse/NUTCH-2088 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.11 At the moment, all the Handlers run for every URL when using the interactive-selenium plugin. Often times when trying to do a deep crawl of a site you'll want to handle various subdomains and paths/files differently. You can effectively filter in the handlers at the moment, but only once you've loaded the WebDriver and incurred the associated overhead. It would be much nicer if the handler interface allowed for this check to occur prior to the request to retrieve page content. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2082) Upgrade to Apache Tika 1.10
[ https://issues.apache.org/jira/browse/NUTCH-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703153#comment-14703153 ] Michael Joyce commented on NUTCH-2082: -- FYI, this is a duplicate of NUTCH-2077 I think [~lewismc]. Upgrade to Apache Tika 1.10 --- Key: NUTCH-2082 URL: https://issues.apache.org/jira/browse/NUTCH-2082 Project: Nutch Issue Type: Improvement Components: build, plugin Affects Versions: 2.3, 1.10 Reporter: Lewis John McGibbney Fix For: 1.11, 2.3.1 Tika 1.10 is hot http://search.maven.org/#artifactdetails|org.apache.tika|tika|1.10|pom Lets upgrade -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701707#comment-14701707 ] Michael Joyce commented on NUTCH-2049: -- Great stuff Lewis. Builds and runs cleanly locally for me. I also scoped a test that was run on EMR with 2.4.0 and all looks good. Upgrade Trunk to Hadoop 2.4 stable Key: NUTCH-2049 URL: https://issues.apache.org/jira/browse/NUTCH-2049 Project: Nutch Issue Type: Improvement Components: build Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-2049.patch, NUTCH-2049v2.patch Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html I am +1 for taking trunk (or a branch of trunk) to explicit dependency on Hadoop 2.6. We can run our tests, we can validate, we can fix. I will be doing validation on 2.X in paralegal as this is what I use on my own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2049) Upgrade Trunk to Hadoop 2.4 stable
[ https://issues.apache.org/jira/browse/NUTCH-2049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694210#comment-14694210 ] Michael Joyce commented on NUTCH-2049: -- Hey [~lewismc], Tried your patch here. Seems I have to add the following to the ivy.xml file to get this to work at all {code} dependency org=org.apache.hadoop name=hadoop-mapreduce-client-jobclient rev=2.4.0 conf=*-default/ {code} Otherwise, I end up getting the following when I try to run a test crawl {code} Injector: starting at 2015-08-12 15:04:42 Injector: crawlDb: crawl/crawldb Injector: urlDir: ../../urls_test Injector: Converting injected urls to crawl db entries. Injector: java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:82) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:75) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:470) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:832) at org.apache.nutch.crawl.Injector.inject(Injector.java:323) at org.apache.nutch.crawl.Injector.run(Injector.java:379) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.Injector.main(Injector.java:369) {code} However, after addressing that concern I end up runnign into the following on the test crawl {code} java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to org.apache.hadoop.io.MapFile$Writer$Option at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: java.lang.ClassCastException: org.apache.hadoop.io.SequenceFile$Writer$KeyClassOption cannot be cast to org.apache.hadoop.io.MapFile$Writer$Option at org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(FetcherOutputFormat.java:70) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.init(ReduceTask.java:484) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-08-12 14:24:39,906 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:496) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:532) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:505) {code} Upgrade Trunk to Hadoop 2.4 stable Key: NUTCH-2049 URL: https://issues.apache.org/jira/browse/NUTCH-2049 Project: Nutch Issue Type: Improvement Components: build Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: NUTCH-2049.patch Convo here - http://www.mail-archive.com/dev%40nutch.apache.org/msg18225.html I am +1 for taking trunk (or a branch of trunk) to explicit dependency on Hadoop 2.6. We can run our tests, we can validate, we can fix. I will be doing validation on 2.X in paralegal as this is what I use on my own projects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646423#comment-14646423 ] Michael Joyce commented on NUTCH-2062: -- Hi folks, Is there something I need to do to get this merged? Anything missing from the updated pull request? I'm happy to update as needed! Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Michael Joyce Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646488#comment-14646488 ] Michael Joyce edited comment on NUTCH-2062 at 7/29/15 5:50 PM: --- Cheers Chris, responded on the PR. Also, where are you getting a direct link for the comment from?? Edit: Nevermind, I see where you got it from. I never noticed that before. https://github.com/apache/nutch/pull/46#issuecomment-126030222 was (Author: mjoyce): Cheers Chris, responded on the PR. Also, where are you getting a direct link for the comment from?? Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646488#comment-14646488 ] Michael Joyce commented on NUTCH-2062: -- Cheers Chris, responded on the PR. Also, where are you getting a direct link for the comment from?? Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2048: - Attachment: NUTCH-2048_Joyce_20150727.patch Updated the patch to set the sync attribute on retrieve so the lib directory should stay clean now. parse-tika: fix dependencies in plugin.xml -- Key: NUTCH-2048 URL: https://issues.apache.org/jira/browse/NUTCH-2048 Project: Nutch Issue Type: Improvement Affects Versions: 1.10 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.11 Attachments: NUTCH-2048_Joyce_20150723.patch, NUTCH-2048_Joyce_20150723_2.patch, NUTCH-2048_Joyce_20150727.patch Duplicate library dependencies listed in parse-tika's plugin.xml should be cleaned up. There are a duplicates, only the version differs, e.g.: {noformat} tika-parsers-1.7.jar tika-parsers-1.8.jar {noformat} Not critical because libs which are not present should be just ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1936) GSoC 2015 - Move Nutch to Hadoop 2.X
[ https://issues.apache.org/jira/browse/NUTCH-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641210#comment-14641210 ] Michael Joyce commented on NUTCH-1936: -- Ah this is absolutely awesome Lewis. Great job on this. GSoC 2015 - Move Nutch to Hadoop 2.X Key: NUTCH-1936 URL: https://issues.apache.org/jira/browse/NUTCH-1936 Project: Nutch Issue Type: Task Components: build Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: gsoc2015 Fix For: 2.4, 1.11 Attachments: NUTCH-1939.patch The Nutch PMC [discussed|http://www.mail-archive.com/dev%40nutch.apache.org/msg16250.html] ideas for a good 2015 GSoC project. It appears that porting the (trunk) codebase to [Hadoop 2.X|http://hadoop.apache.org/docs/stable/] seems to an attractive option and one which would present an excellent learning experience for a summer student. A more comprehensive description of this issue should be included within either a mentor-defined project description or a successful student application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639462#comment-14639462 ] Michael Joyce commented on NUTCH-2048: -- Alright, hopefully this one is a bit more on track =D As for plugin dependencies docs, here a huge +1 from me. I don't know that I'm necessarily versed enough in the build to do it myself but it would be a great to get up on the wiki. As for Tika upgrades, there's actually a how to in the parse-tika folder. I went through that and ended up with the current patch which seems to have addressed the duplicate dependency issues. Given the instructions I'm not really certain how we ended up with the duplicates in the first place though. Maybe the doc is a recent addition {code} 1. Upgrade Tika depencency in trunk/ivy/ivy.xml 2. Upgrade Tika dependency in src/plugin/parse-tika/ivy.xml 3. Upgrade Tika's own dependencies in src/plugin/parse-tika/plugin.xml To get the list of dependencies and their versions execute: $ ant -f ./build-ivy.xml $ ls lib | sed 's/^/ library name=/g' | sed 's/$/\//g' {code} parse-tika: fix dependencies in plugin.xml -- Key: NUTCH-2048 URL: https://issues.apache.org/jira/browse/NUTCH-2048 Project: Nutch Issue Type: Improvement Affects Versions: 1.10 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.11 Attachments: NUTCH-2048_Joyce_20150723.patch, NUTCH-2048_Joyce_20150723_2.patch Duplicate library dependencies listed in parse-tika's plugin.xml should be cleaned up. There are a duplicates, only the version differs, e.g.: {noformat} tika-parsers-1.7.jar tika-parsers-1.8.jar {noformat} Not critical because libs which are not present should be just ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2048: - Attachment: NUTCH-2048_Joyce_20150723_2.patch Patch #2 up. Explanation to follow shortly parse-tika: fix dependencies in plugin.xml -- Key: NUTCH-2048 URL: https://issues.apache.org/jira/browse/NUTCH-2048 Project: Nutch Issue Type: Improvement Affects Versions: 1.10 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.11 Attachments: NUTCH-2048_Joyce_20150723.patch, NUTCH-2048_Joyce_20150723_2.patch Duplicate library dependencies listed in parse-tika's plugin.xml should be cleaned up. There are a duplicates, only the version differs, e.g.: {noformat} tika-parsers-1.7.jar tika-parsers-1.8.jar {noformat} Not critical because libs which are not present should be just ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14639396#comment-14639396 ] Michael Joyce commented on NUTCH-2048: -- Ah I clearly didn't pay enough attention to this [~wastl-nagel]. I was wondering why the heck you didn't just fix it yourself when you opened it ;) I'll see what I can do parse-tika: fix dependencies in plugin.xml -- Key: NUTCH-2048 URL: https://issues.apache.org/jira/browse/NUTCH-2048 Project: Nutch Issue Type: Improvement Affects Versions: 1.10 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.11 Attachments: NUTCH-2048_Joyce_20150723.patch Duplicate library dependencies listed in parse-tika's plugin.xml should be cleaned up. There are a duplicates, only the version differs, e.g.: {noformat} tika-parsers-1.7.jar tika-parsers-1.8.jar {noformat} Not critical because libs which are not present should be just ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2048) parse-tika: fix dependencies in plugin.xml
[ https://issues.apache.org/jira/browse/NUTCH-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2048: - Attachment: NUTCH-2048_Joyce_20150723.patch Quick patch up for this. parse-tika: fix dependencies in plugin.xml -- Key: NUTCH-2048 URL: https://issues.apache.org/jira/browse/NUTCH-2048 Project: Nutch Issue Type: Improvement Affects Versions: 1.10 Reporter: Sebastian Nagel Priority: Trivial Fix For: 1.11 Attachments: NUTCH-2048_Joyce_20150723.patch Duplicate library dependencies listed in parse-tika's plugin.xml should be cleaned up. There are a duplicates, only the version differs, e.g.: {noformat} tika-parsers-1.7.jar tika-parsers-1.8.jar {noformat} Not critical because libs which are not present should be just ignored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2063: - Labels: memex (was: ) Add -mimeStats flag to FileDumper tool -- Key: NUTCH-2063 URL: https://issues.apache.org/jira/browse/NUTCH-2063 Project: Nutch Issue Type: Bug Components: dumpers Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Michael Joyce Labels: memex Fix For: 1.11 Attachments: nutch-2063-joyce-21July2015.patch Right now in order to get a MimeType distribution for any given number of segments, one is required to dump some data. This is a waste if one just wishes to see the mime type distribution across a number of segments. An improvement to the FileDumper tool would be the addition of a -mimeStats flag which would not attempt to dump any data but instead merely provide the total stats message providing insight into how the FileDumper should be best used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2004) ParseChecker does not handle redirects
[ https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2004: - Labels: memex (was: ) ParseChecker does not handle redirects -- Key: NUTCH-2004 URL: https://issues.apache.org/jira/browse/NUTCH-2004 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Assignee: Michael Joyce Priority: Minor Labels: memex Fix For: 1.11 At the moment ParseChecker doesn't handle redirects. If it gets anything but a success status it errors out. It would be nice if it handled redirects a bit more gracefully based on the http.redirects config setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636958#comment-14636958 ] Michael Joyce commented on NUTCH-2062: -- Cheers [~lewismc], let me see what I can do with regards to updating the PR with these updates. Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Michael Joyce Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635389#comment-14635389 ] Michael Joyce commented on NUTCH-2062: -- Hi folks, Just wanted to elaborate a bit on what this does at the moment and what the point of it is. This plugin is effectively the protocol-selenium plugin but it allows for a handler to interact with the WebDriver before returning the page content. Handlers require a simple interface to be implemented. Which handler(s) are run is determined by setting the class name of the handler in a comma separated list in the config. For each URL, all the handlers are run in config-specified order. The resulting content from each driver is appended together and returned as the content. Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.11 The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635706#comment-14635706 ] Michael Joyce commented on NUTCH-2063: -- Hey [~lewismc], threw a patch up for this. Let me know if you want to change something. Add -mimeStats flag to FileDumper tool -- Key: NUTCH-2063 URL: https://issues.apache.org/jira/browse/NUTCH-2063 Project: Nutch Issue Type: Bug Components: dumpers Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: nutch-2063-joyce-21July2015.patch Right now in order to get a MimeType distribution for any given number of segments, one is required to dump some data. This is a waste if one just wishes to see the mime type distribution across a number of segments. An improvement to the FileDumper tool would be the addition of a -mimeStats flag which would not attempt to dump any data but instead merely provide the total stats message providing insight into how the FileDumper should be best used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2063) Add -mimeStats flag to FileDumper tool
[ https://issues.apache.org/jira/browse/NUTCH-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-2063: - Attachment: nutch-2063-joyce-21July2015.patch Add -mimeStats flag to FileDumper tool -- Key: NUTCH-2063 URL: https://issues.apache.org/jira/browse/NUTCH-2063 Project: Nutch Issue Type: Bug Components: dumpers Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: nutch-2063-joyce-21July2015.patch Right now in order to get a MimeType distribution for any given number of segments, one is required to dump some data. This is a waste if one just wishes to see the mime type distribution across a number of segments. An improvement to the FileDumper tool would be the addition of a -mimeStats flag which would not attempt to dump any data but instead merely provide the total stats message providing insight into how the FileDumper should be best used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
Michael Joyce created NUTCH-2062: Summary: Add Plugin for interacting with Selenium WebDriver Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.11 The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633731#comment-14633731 ] Michael Joyce commented on NUTCH-2062: -- Hi folks, I have a work-in progress locally for this. I'm working on making some more changes and should hopefully have something useful up soon for feedback Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Fix For: 1.11 The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1504) Pluggable url partitioner
[ https://issues.apache.org/jira/browse/NUTCH-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599958#comment-14599958 ] Michael Joyce commented on NUTCH-1504: -- This is great stuff [~lewismc], we definitely need to get this in there. Would help us out a great deal. Pluggable url partitioner - Key: NUTCH-1504 URL: https://issues.apache.org/jira/browse/NUTCH-1504 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.6 Reporter: Sourajit Basak Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: custom.partitioner.patch At present, the url partition logic is hard wired inside nutch core. It should be pluggable like FetchSchedule customized via nutch-site.xml. There might be use cases where a single domain needs to be partioned on some custom logic. The existing UrlPartitioner cannot handle such cases. Hence the requirement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2045) index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time
[ https://issues.apache.org/jira/browse/NUTCH-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596832#comment-14596832 ] Michael Joyce commented on NUTCH-2045: -- +1 this is great index-basic incorrect assignment of next fetch time (page.getFetchTime()) as page fetch time Key: NUTCH-2045 URL: https://issues.apache.org/jira/browse/NUTCH-2045 Project: Nutch Issue Type: Bug Components: plugin Affects Versions: 2.3, 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11, 2.3.1 Attachments: NUTCH-2045.patch The issue here as flagged up when using indexer-elastic plugin where the page fetch time is incorrectly assigned as the NEXT fetch time as oppose to the time at which the page was actually fetched (prevFetchTime). The ML thread for this issue can be found below http://www.mail-archive.com/user%40nutch.apache.org/msg13661.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2004) ParseChecker does not handle redirects
Michael Joyce created NUTCH-2004: Summary: ParseChecker does not handle redirects Key: NUTCH-2004 URL: https://issues.apache.org/jira/browse/NUTCH-2004 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor At the moment ParseChecker doesn't handle redirects. If it gets anything but a success status it errors out. It would be nice if it handled redirects a bit more gracefully based on the http.redirects config setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2004) ParseChecker does not handle redirects
[ https://issues.apache.org/jira/browse/NUTCH-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520028#comment-14520028 ] Michael Joyce commented on NUTCH-2004: -- Hi folks, will try to get a patch thrown up shortly for this. ParseChecker does not handle redirects -- Key: NUTCH-2004 URL: https://issues.apache.org/jira/browse/NUTCH-2004 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor At the moment ParseChecker doesn't handle redirects. If it gets anything but a success status it errors out. It would be nice if it handled redirects a bit more gracefully based on the http.redirects config setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503746#comment-14503746 ] Michael Joyce commented on NUTCH-1934: -- Hey [~lewismc], Patch applied clean to trunk for me and simple crawl over one site worked just fine. Couldn't run the tests unfortunately since I seem to have some config problem locally, but hopefully that's a start at least. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk
[ https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503727#comment-14503727 ] Michael Joyce commented on NUTCH-1934: -- Once sec Lewis and I'll take a quick scope. Refactor Fetcher in trunk - Key: NUTCH-1934 URL: https://issues.apache.org/jira/browse/NUTCH-1934 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.10 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Labels: memex Fix For: 1.11 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch Put simply [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java] is too big. This is kinda strange as the size of this file is unique (I think) from every other class within Nutch. The others are reasonably well modularized and split into constituent classes which make sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503446#comment-14503446 ] Michael Joyce commented on NUTCH-1987: -- Hi folks, PR has been updated with the requested changes. If you have any questions or think anything else needs changing let me know. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14501674#comment-14501674 ] Michael Joyce commented on NUTCH-1987: -- Hey Chris, Will do. I'll try to take a poke at updating this tomorrow/Monday when I have a bit of free time. Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings
[ https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1986: - Labels: memex (was: ) Clarify Elastic Search Indexer Plugin Settings -- Key: NUTCH-1986 URL: https://issues.apache.org/jira/browse/NUTCH-1986 Project: Nutch Issue Type: Improvement Components: documentation, indexer, plugin Affects Versions: 1.9 Reporter: Michael Joyce Labels: memex Fix For: 1.10 Was working on getting indexing into elastic search working and realized that the majority of my difficulties were simply me misunderstanding what the config needed. Patch incoming to hopefully clarify what is needed by default, what each option does, and add any helpful defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498689#comment-14498689 ] Michael Joyce commented on NUTCH-1911: -- Hey folks, Here's what the output from this looks like {code} Usage: DomainStatistics inputDirs outDir mode [numOfReducer] inputDirs Comma separated list of crawldb input directories E.g.: crawl/crawldb/current/ outDir Output directory where results should be dumped modeSet statistics gathering mode hostGather statistics by host domain Gather statistics by domain suffix Gather statistics by suffix tld Gather statistics by top level directory [numOfReducers] Optional number of reduce jobs to use. Defaults to 1. {code} Imeprove DomainStatistics tool command line parsing --- Key: NUTCH-1911 URL: https://issues.apache.org/jira/browse/NUTCH-1911 Project: Nutch Issue Type: Bug Components: util Affects Versions: 1.9, 2.2.1 Reporter: Lewis John McGibbney Priority: Trivial Fix For: 1.11 The DomainStatistic's tool could be improved based on the comments addressed in [this mai thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] For convenience, I've also pasted them below {quote} You cannot just tell it where the crawldb is, you need to tell it where the directory is, so specifying current is ok, but not part-* {quote} Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1988: - Labels: memex (was: ) Make nested output directory dump optional -- Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Priority: Minor Labels: memex Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1906) Typo in CrawlDbReader command line help
[ https://issues.apache.org/jira/browse/NUTCH-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498573#comment-14498573 ] Michael Joyce commented on NUTCH-1906: -- Hi folks, I'll throw a patch up shortly for this. Typo in CrawlDbReader command line help --- Key: NUTCH-1906 URL: https://issues.apache.org/jira/browse/NUTCH-1906 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.9 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Trivial Fix For: 1.11 Currently the CrawlDbReader tool, when invoked without any command line arguments helps us as follows {code} [mdeploy@crawl local]$ ./bin/nutch readdb Usage: CrawlDbReader crawldb (-stats | -dump out_dir | -topN out_dir [min] | -url url) crawldb directory name where crawldb is located -stats [-sort] print overall statistics to System.out [-sort] list status sorted by host -dump out_dir [-format normal|csv|crawldb]dump the whole db to a text file in out_dir [-format csv] dump in Csv format [-format normal]dump in standard format (default option) [-format crawldb] dump as CrawlDB [-regex expr] filter records with expression [-retry num] minimum retry count [-status status] filter records by CrawlDatum status -url url print information on url to System.out -topN out_dir [min] dump top urls sorted by score to out_dir [min] skip records with scores below this value. This can significantly improve performance. {code} The code that bothers me is {code} -stats [-sort] print overall statistics to System.out [-sort] list status sorted by host {code} The inclusion of the double -sort is not necessary or required. Having looked through the code there is no other optional flag which we can substitute for the second one (which I thought may lead to this being a placeholder for something else) therefore we can just remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1987: - Labels: memex (was: ) Make bin/crawl indexer agnostic --- Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Labels: memex Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ http://fakeurl.com:9200; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)