> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote: > > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 101 > > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line101> > > > > When I change the Text() class to use the UTF8() class, I get the > > following > > > > lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ > > ./bin/nutch org.apache.nutch.tools.FileDumper . > > /usr/local/trunk/src/testresources/testcrawl/segments/ > > 2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from > > SCDynamicStore > > Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main > > INFO: Processing segment: > > [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635] > > Exception in thread "main" java.io.EOFException > > at java.io.DataInputStream.readFully(DataInputStream.java:197) > > at java.io.DataInputStream.readFully(DataInputStream.java:169) > > at > > org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99) > > at > > org.apache.nutch.protocol.Content.readFields(Content.java:154) > > at > > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) > > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101) > > > > UTF8 is of course deprecated now so we need to stick with Text and > > implement the corect code.
hey @Lewis, not sure if this is really an error or not. I grepped around all the Nutch code, and also did a find -name for anything that references testcrawl. No Nutch code in src/test or src/java reference it. So I'm not sure that we should be using old UTF8 (instead of Text) crawl dirs here. I will go ahead and add some exception handling anyways and try to make it more robust. [chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" test [chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" * [chipotle:~/src/nutch/src] mattmann% find . -name "testcrawl" -print - Chris ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9119/#review52796 ----------------------------------------------------------- On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/9119/ > ----------------------------------------------------------- > > (Updated Sept. 10, 2014, 3:15 a.m.) > > > Review request for nutch. > > > Bugs: NUTCH-1526 > https://issues.apache.org/jira/browse/NUTCH-1526 > > > Repository: nutch > > > Description > ------- > > Will contain the patch the SegmentContentDumperTool described in NUTCH-1526: > > ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] > -segmentRootDir full file path to the root segment directory, e.g., > crawl/segments > -regexUrlPattern a regex URL pattern to select URL keys to dump from the > content DB in each segment > -outputDir The output directory to write file names to. > -metadata --key=value where key is a Content Metadata key and value is a > value to check. > > > Diffs > ----- > > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION > > Diff: https://reviews.apache.org/r/9119/diff/ > > > Testing > ------- > > Testing it on DARPA XDATA XNET. > > > Thanks, > > Chris Mattmann > >