Ted, David - thanks for your replies. I thought Hadoop would
automatically split the file but it is not. The vectors file generated
from build-reuters.sh (by using
org.apache.mahout.utils.vectors.lucene.Driver over the Lucene index)
comes out to around 8.8 mb. Perhaps that is to small and won't be
split if it's below the HDFS block size. I'm using the default 64mb
for the HDFS. Perhaps a custom InputSplit/RecordReader is needed to
split the sequence file. I'll investigate further. If anyone has
further pointers or more info please chime in.

Thanks,
Chad

> It should just happen if the file is large enough and the program is
> configured for more than one mapper task and the file type is correct.

> If you are reading an uncompressed sequence file you should be set.

> On Mon, Jan 11, 2010 at 9:53 PM, David Hall <[email protected]> wrote:

>>  I can brush up on my hadoop foo to figure out how to have
>> hadoop split up a single file, if you want.
>>

>--
>Ted Dunning, CTO
>DeepDyve

Reply via email to