Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Chris Mattmann Sat, 20 Sep 2014 11:00:33 -0700


> On Sept. 9, 2014, 11:40 p.m., Lewis McGibbney wrote:
> > ./trunk/src/java/org/apache/nutch/tools/FileDumper.java, line 101
> > <https://reviews.apache.org/r/9119/diff/1/?file=681989#file681989line101>
> >
> >     When I change the Text() class to use the UTF8() class, I get the 
> > following
> >     
> >     lmcgibbn@LMC-032857 /usr/local/trunk/runtime/local(master) $ 
> > ./bin/nutch org.apache.nutch.tools.FileDumper . 
> > /usr/local/trunk/src/testresources/testcrawl/segments/
> >     2014-09-09 16:02:21.339 java[3942:1903] Unable to load realm info from 
> > SCDynamicStore
> >     Sep 09, 2014 4:02:21 PM org.apache.nutch.tools.FileDumper main
> >     INFO: Processing segment: 
> > [/usr/local/trunk/src/testresources/testcrawl/segments/20060919213635]
> >     Exception in thread "main" java.io.EOFException
> >             at java.io.DataInputStream.readFully(DataInputStream.java:197)
> >             at java.io.DataInputStream.readFully(DataInputStream.java:169)
> >             at 
> > org.apache.nutch.protocol.Content.readFieldsCompressed(Content.java:99)
> >             at 
> > org.apache.nutch.protocol.Content.readFields(Content.java:154)
> >             at 
> > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
> >             at org.apache.nutch.tools.FileDumper.main(FileDumper.java:101)
> >         
> >     UTF8 is of course deprecated now so we need to stick with Text and 
> > implement the corect code.


hey @Lewis, not sure if this is really an error or not. I grepped around all 
the Nutch code, and also did a find -name for anything that references 
testcrawl. No Nutch code in src/test or src/java reference it. So I'm not sure 
that we should be using old UTF8 (instead of Text) crawl dirs here. I will go 
ahead and add some exception handling anyways and try to make it more robust. 

[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" test
[chipotle:~/src/nutch/src] mattmann% grep -R "testcrawl" *
[chipotle:~/src/nutch/src] mattmann% find . -name "testcrawl" -print


- Chris


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9119/#review52796
-----------------------------------------------------------


On Sept. 10, 2014, 3:15 a.m., Chris Mattmann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9119/
> -----------------------------------------------------------
> 
> (Updated Sept. 10, 2014, 3:15 a.m.)
> 
> 
> Review request for nutch.
> 
> 
> Bugs: NUTCH-1526
>     https://issues.apache.org/jira/browse/NUTCH-1526
> 
> 
> Repository: nutch
> 
> 
> Description
> -------
> 
> Will contain the patch the SegmentContentDumperTool described in NUTCH-1526:
> 
> ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
>    -segmentRootDir full file path to the root segment directory, e.g., 
> crawl/segments
>    -regexUrlPattern a regex URL pattern to select URL keys to dump from the 
> content DB in each segment
>    -outputDir The output directory to write file names to.
>    -metadata --key=value where key is a Content Metadata key and value is a 
> value to check.
> 
> 
> Diffs
> -----
> 
>   ./trunk/src/java/org/apache/nutch/tools/FileDumper.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/9119/diff/
> 
> 
> Testing
> -------
> 
> Testing it on DARPA XDATA XNET.
> 
> 
> Thanks,
> 
> Chris Mattmann
> 
>

Re: Review Request 9119: Create SegmentContentDumperTool for easily extracting out file contents from SegmentDirs

Reply via email to