Not sure how you implement it so it is hard to tell. You may want to take a look at the SegmentReader's get and getMapRecords methods. Those may give you ideas. You can use SegmentReader.get directly to get the segment data too. While it is slow as it slepp(5000) at every time you call it, so slow that you definitely cannot get the result tomorrow by running it on your 50K urls data set. Muti-threading to call the SegmentReader.get on all the segments at the same time can speed this up, while if you have a lot of segments(like me, > 20), OutOfMemory issue will come to you, even if you set the java heap size to be 4GBs(or even more) I am stuck at here. T_T
Zhique On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh <[email protected]> wrote: > I am using the MapFileReader to iterate through the file. And I read the > key into a Text object and the MetaData into a ParseData object. I get the > following exception: > > Exception in thread "main" java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at org.apache.hadoop.io.Text.readString(Text.java:402) > at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) > at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) > at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) > at NearDuplicates.main(NearDuplicates.java:58) > > Thanks, > > Regards, > Ami Parikh > (213)590-0005 > > On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang <[email protected]> wrote: > >> Hi Ami, >> >> What method of what class do you use to get the meta data? Please provide >> more info about this, log etc. >> >> Zhique >> >> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <[email protected]> >> wrote: >> >>> Hello, >>> >>> When I try to use the parse_data from the segment directory for getting >>> the MetaData for finding near duplicates, My code runs into a EOFException. >>> I found something about a bug in nutch in the archives, but I wanted to >>> know if anyone else is facing this problem and how can I possibly resolve >>> it. >>> >>> Thanks, >>> >>> Regards, >>> Ami Parikh >>> (213)590-0005 >>> >> >> >

