Ya. I know about that. But I just thought that because Parse_Data already does that for us, I did not want to do tthe same processing again. I will try to figure something out. Thanks a lot.
Regards, Ami Parikh (213)590-0005 On Thu, Feb 26, 2015 at 12:39 PM, Renxia Wang <[email protected]> wrote: > Not sure how you implement it so it is hard to tell. You may want to take > a look at the SegmentReader's get and getMapRecords methods. Those may give > you ideas. You can use SegmentReader.get directly to get the segment data > too. While it is slow as it slepp(5000) at every time you call it, so slow > that you definitely cannot get the result tomorrow by running it on your > 50K urls data set. Muti-threading to call the SegmentReader.get on all the > segments at the same time can speed this up, while if you have a lot of > segments(like me, > 20), OutOfMemory issue will come to you, even if you > set the java heap size to be 4GBs(or even more) I am stuck at here. T_T > > Zhique > > > > On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh <[email protected]> > wrote: > >> I am using the MapFileReader to iterate through the file. And I read the >> key into a Text object and the MetaData into a ParseData object. I get the >> following exception: >> >> Exception in thread "main" java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:197) >> at org.apache.hadoop.io.Text.readString(Text.java:402) >> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) >> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) >> at >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) >> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) >> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) >> at NearDuplicates.main(NearDuplicates.java:58) >> >> Thanks, >> >> Regards, >> Ami Parikh >> (213)590-0005 >> >> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang <[email protected]> wrote: >> >>> Hi Ami, >>> >>> What method of what class do you use to get the meta data? Please >>> provide more info about this, log etc. >>> >>> Zhique >>> >>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> When I try to use the parse_data from the segment directory for getting >>>> the MetaData for finding near duplicates, My code runs into a EOFException. >>>> I found something about a bug in nutch in the archives, but I wanted to >>>> know if anyone else is facing this problem and how can I possibly resolve >>>> it. >>>> >>>> Thanks, >>>> >>>> Regards, >>>> Ami Parikh >>>> (213)590-0005 >>>> >>> >>> >> >

