Re: SequenceFile as map input

2010-07-09 Thread Alex Kozlov
Hi Alan, You don't need to do this complex trickery if you write to the Sequence File. How do you create the Sequence File? In your case it might make sense to create a Sequence File where the first object is the file name or compete path and the second is the content. Then you just call: pr

Re: SequenceFile as map input

2010-07-09 Thread Alan Miller
Hi Alex, My original files are ascii text. I was using and everything worked fine. Because my files are small (>2MB on avg.) I get one-map task per file. For my test I had 2000 files, totalling 5GB and the whole run took approx 40 minutes. I read that I could improve performance by merging

Re: SequenceFile as map input

2010-07-08 Thread Alex Kozlov
Hi Alan, Is the content of the original file ascii text? Then you should be using signature. By default 'hadoop fs -text ...' just will call toString() on the object. You get the object itself in the map() method and can do whatever you want with it. If Text or BytesWritable does not work for

Re: SequenceFile as map input

2010-07-08 Thread Alan Miller
Hi Alex, I'm not sure what you mean. I already set my mapper's signature to: public class MyMapper extends Mapper { ... public void map(Text key, BytesWritable value, Context context) } } In my map() loop the contents of value is the text from the original file and the value.

Re: SequenceFile as map input

2010-07-08 Thread Alex Loddengaard
Hi Alan, SequenceFiles keep track of the key and value type, so you should be able to use the Writables in the signature. Though it looks like you're using the new API, and I admit that I'm not an expert with the new API. Have you tried using the Writables in the signature? Alex On Thu, Jul 8,

SequenceFile as map input

2010-07-08 Thread Some Body
To get around the small-file-problem (I have thousands of 2MB log files) I wrote a class to convert all my log files into a single SequenceFile in (Text key, BytesWritable value) format. That works fine. I can run this: hadoop fs -text /my.seq |grep peemt114.log | head -1 10/07/08 15:02: