Re: Hadoop custom readers and writers

Dennis Kubes Sun, 07 Sep 2008 06:53:13 -0700

Seems the stuff in Nutch trunk is older, I have an updated version. Ihave sent it to you directly.


Amit K Singh wrote:

Thanks dennis,
I get it, you mean that the big arc file was not split and there was one map
per arc file.


In the new code a single file can be split into multiple maps.

Also I noted that in ARCinputform getSplits is not overidden, so how do ya
make sure that arc file is not split ?. (number of maps property in config
??)

It really shouldn't matter unless you need all the maps from a giveninput file in the say part-xxxxx output file and a partitioner wouldn'twork. You actually want it to be able to be broken up so it can scaleproperly.

Also any pointers on the other two questions1) getSplits for TextInputFormat splits at arbitary bytes. now that might
lead to truncated line for 2- mappers. How and where in src code  is that
dealt. Any pointers would be of great help.

The new code finds gzip boundaries. and splits at that. It willactually start scanning forward to find the next record at a split.Anything before is handled by a different map task that scans a littleover its split index.

2) class Record is used for what purpose.


Record or RecordReader?

Dennis

Dennis Kubes-2 wrote:
We did something similar with the ARC format where is record (webpage)is gzipped and then appended. It is not exactly the same but it mayhelp. Take a look at the following classes, they are in the Nutch trunk:
org.apache.nutch.tools.arc.ArcInputFormat
org.apache.nutch.tools.arc.ArcRecordReader
The way we did it though was to create an InputFormat and RecordReaderthat extended FileInputFormat and would read and uncompress the recordson the fly. Unless your files are small I would recommend going that
route.

Dennis

Amit Simgh wrote:
Hi,
I have thousands of webpages each represented as serialized tree objectcompressed (ZLIB) together (file size varying from 2.5 GB to 4.5GB).
I have to do some heavy text processing on these pages.

What the the best way to read /access these pages.

Method1
***************
1) Write Custom Splitter that
1. uncompresses the file(2.5GB to 4GB) and then parses it(time :around 10 minutes )
  2. Splits the binary data in to parts 10-20
2) Implement specific readers to read a page and present it to mapper

OR.

Method -2
***************
Read the entire file w/o splitting : one one Map task per file.
Implement specific readers to read a page and present it to mapper

Slight detour:
I was browing thru code in FileInputFormat and TextInputFormat. IngetSplit method the file is broken at arbitary byte boundaries.So in case of TextInputFormat what if last line of mapper is truncated(incomplete byte sequence). what happens. Is truncated data lost orrecoveredCan someone explain and give pointers in code where and how thisrecovery happens?
I also saw classes like Records . What are these used for?


Regds
Amit S

Re: Hadoop custom readers and writers

Reply via email to