Re: Java RMI and Hadoop RecordIO

2009-01-20 Thread Steve Loughran
David Alves wrote: Hi I've been testing some different serialization techniques, to go along with a research project. I know motivation behind hadoop serialization mechanism (e.g. Writable) and the enhancement of this feature through record I/O is not only performance, but also

NLineInputFormat and very high number of maptasks

2009-01-20 Thread Saptarshi Guha
Hello, When I use NLIneInputFormat, when I output: System.out.println(mapred.map.tasks:+jobConf.get(mapred.map.tasks)); I see 51, but on the jobtracker site, the number is 18114. Yet with TextInputFormat it shows 51. I'm using Hadoop - 0.19 Any ideas why? Regards Saptarshi -- Saptarshi

Re: NLineInputFormat and very high number of maptasks

2009-01-20 Thread Saptarshi Guha
Sorry, i see - every line is now a maptask - one split,one task.(in this case N=1 line per split) Is that correct? Saptarshi On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote: Hello, When I use NLIneInputFormat, when I output: System

Class Not found error

2009-01-20 Thread Shyam Sarkar
Hi, I am following instructions for example wordcount version 2 execution on Hadoop installed under Cygwin. The quick start example worked fine. But word count version 2 is giving following error: java.lang.ClassNotFoundException: org.myorg.WordCount  at

Come join me on Open Source University Meetup

2009-01-20 Thread Vinayak Katkar
Open Source University Meetup: Hi all, Please Join Sun Microsystems Open Source University Meetup Its place to share your thoughts, express your feelings , create your blogs and start the discussions on any open source technology Thanks and Regards Vinayak Katkar Sun

Null Pointer with Pattern file

2009-01-20 Thread Shyam Sarkar
Hi, I was trying to run Hadoop wordcount version 2 example under Cygwin. I tried without pattern.txt file -- It works fine. I tried with pattern.txt file to skip some patterns, I get NULL POINTER exception as follows:: 09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with

streaming split sizes

2009-01-20 Thread Dmitry Pushkarev
Hi. I'm running streaming on relatively big (2Tb) dataset, which is being split by hadoop in 64mb pieces. One of the problems I have with that is my map tasks take very long time to initialize (they need to load 3GB database into RAM) and they are finishing these 64mb in 10 seconds. So

Re: streaming split sizes

2009-01-20 Thread Delip Rao
Hi Dmitry, Not a direct answer to your question but I think the right approach would be to not load your database into memory during config() but instead lookup the database from map() via Hbase or something similar. That way you don't have to worry about the split sizes. In fact using fewer

RE: streaming split sizes

2009-01-20 Thread Dmitry Pushkarev
Well, database is specifically designed to fit into memory and if it is not it will slow things down hundreds of time. One simple hack I came to is to replace map tasks by /bin/cat and then run 150 reducers that will have database constantly in memory. Parallelism is also not a problems, since

Re: NLineInputFormat and very high number of maptasks

2009-01-20 Thread Amareshwari Sriramadasu
Saptarshi Guha wrote: Sorry, i see - every line is now a maptask - one split,one task.(in this case N=1 line per split) Is that correct? Saptarshi You are right. NLineInputFormat splits N lines of input as one split and each split is given to a map task. By default, N is 1. N can configured