Re: NLineInputFormat and very high number of maptasks

2009-01-20 Thread Amareshwari Sriramadasu

Saptarshi Guha wrote:
Sorry, i see - every line is now a maptask - one split,one task.(in 
this case N=1 line per split)

Is that correct?
Saptarshi

You are right. NLineInputFormat splits N lines of input as one split and 
each split is given to a map task.
By default, N is 1. N can configured through the property 
"mapred.line.input.format.linespermap".

On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote:


Hello,
When I use NLIneInputFormat, when I output:
System.out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks")); 

Where are you printing this statement? Looks like the JobConf, that you 
are looking at, is not set with the correct value of number of map tasks 
yet.

I see 51, but on the jobtracker site, the number is 18114. Yet with
TextInputFormat it shows 51.
I'm using Hadoop - 0.19

Any ideas why?
Regards
Saptarshi

--Saptarshi Guha - saptarshi.g...@gmail.com


Saptarshi Guha | saptarshi.g...@gmail.com | 
http://www.stat.purdue.edu/~sguha

If the church put in half the time on covetousness that it does on lust,
this would be a better world.
-- Garrison Keillor, "Lake Wobegon Days"



-Amareshwari


RE: streaming split sizes

2009-01-20 Thread Dmitry Pushkarev
Well, database is specifically designed to fit into memory and if it is not
it will slow things down hundreds of time. One simple hack I came to is to
replace map tasks by /bin/cat and then run 150 reducers that will have
database constantly in memory. Parallelism is also not a problems, since
we're running very small (15 nodes, 120 cores) specifically built for the
task.

---
Dmitry Pushkarev
+1-650-644-8988

-Original Message-
From: Delip Rao [mailto:delip...@gmail.com] 
Sent: Tuesday, January 20, 2009 6:19 PM
To: core-user@hadoop.apache.org
Subject: Re: streaming split sizes

Hi Dmitry,

Not a direct answer to your question but I think the right approach
would be to not load your database into memory during config() but
instead lookup the database from map() via Hbase or something similar.
That way you don't have to worry about the split sizes. In fact using
fewer splits would limit the parallelism you can achieve, given that
your maps are so fast.

- delip

On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev  wrote:
> Hi.
>
>
>
> I'm running streaming on relatively big (2Tb) dataset, which is  being
split
> by hadoop in 64mb pieces.  One of the problems I have with that is my map
> tasks take very long time to initialize (they need to load 3GB database
into
> RAM) and they are finishing these 64mb in 10 seconds.
>
>
>
> So I'm wondering if there is any way to make hadoop give larger datasets
to
> map jobs? (trivial way, of course would be to split dataset to N files and
> make it feed one file at a time, but is there any standard solution for
> that?)
>
>
>
> Thanks.
>
>
>
> ---
>
> Dmitry Pushkarev
>
> +1-650-644-8988
>
>
>
>



Re: streaming split sizes

2009-01-20 Thread Delip Rao
Hi Dmitry,

Not a direct answer to your question but I think the right approach
would be to not load your database into memory during config() but
instead lookup the database from map() via Hbase or something similar.
That way you don't have to worry about the split sizes. In fact using
fewer splits would limit the parallelism you can achieve, given that
your maps are so fast.

- delip

On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev  wrote:
> Hi.
>
>
>
> I'm running streaming on relatively big (2Tb) dataset, which is  being split
> by hadoop in 64mb pieces.  One of the problems I have with that is my map
> tasks take very long time to initialize (they need to load 3GB database into
> RAM) and they are finishing these 64mb in 10 seconds.
>
>
>
> So I'm wondering if there is any way to make hadoop give larger datasets to
> map jobs? (trivial way, of course would be to split dataset to N files and
> make it feed one file at a time, but is there any standard solution for
> that?)
>
>
>
> Thanks.
>
>
>
> ---
>
> Dmitry Pushkarev
>
> +1-650-644-8988
>
>
>
>


streaming split sizes

2009-01-20 Thread Dmitry Pushkarev
Hi.

 

I'm running streaming on relatively big (2Tb) dataset, which is  being split
by hadoop in 64mb pieces.  One of the problems I have with that is my map
tasks take very long time to initialize (they need to load 3GB database into
RAM) and they are finishing these 64mb in 10 seconds. 

 

So I'm wondering if there is any way to make hadoop give larger datasets to
map jobs? (trivial way, of course would be to split dataset to N files and
make it feed one file at a time, but is there any standard solution for
that?)

 

Thanks.

 

---

Dmitry Pushkarev

+1-650-644-8988

 



Null Pointer with Pattern file

2009-01-20 Thread Shyam Sarkar
Hi,

I was trying to run Hadoop wordcount version 2 example under Cygwin. I tried
without pattern.txt file -- It works fine.
I tried with pattern.txt file to skip some patterns, I get NULL POINTER
exception as follows::

09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
09/01/20 12:56:17 WARN mapred.JobClient: No job jar file set.  User classes
may not be found. See JobConf(Class) or JobConf#setJar(String).
09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
: 4
09/01/20 12:56:17 INFO mapred.JobClient: Running job: job_local_0001
09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
: 4
09/01/20 12:56:17 INFO mapred.MapTask: numReduceTasks: 1
09/01/20 12:56:17 INFO mapred.MapTask: io.sort.mb = 100
09/01/20 12:56:17 INFO mapred.MapTask: data buffer = 79691776/99614720
09/01/20 12:56:17 INFO mapred.MapTask: record buffer = 262144/327680
09/01/20 12:56:17 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NullPointerException
 at org.myorg.WordCount$Map.configure(WordCount.java:39)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
 at org.myorg.WordCount.run(WordCount.java:114)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.myorg.WordCount.main(WordCount.java:119)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


Please tell me what should I do.

Thanks,
shyam.s.sar...@gmail.com


Come join me on Open Source University Meetup

2009-01-20 Thread Vinayak Katkar
Open Source University Meetup: 


Hi all,
Please Join Sun Microsystems Open Source University Meetup
Its place to share your thoughts, express your feelings ,
create your blogs and start the discussions on any open source technology 

Thanks and Regards 
Vinayak Katkar
Sun Campus Ambassador
College Of Engineeering,Pune

Click the link below to Join:
http://osum.sun.com/?xgi=7ogoTKe

If your email program doesn't recognize the web address above as an active link,
please copy and paste it into your web browser



Members already on Open Source University Meetup
Siddharth, sonali v, sameer, Pratik, PRADEEP CHAUDHARI



About Open Source University Meetup


42674 members
6382 photos
298 videos
2924 discussions
474 events
705 blog posts



To control which emails you receive on the corner, or to opt-out, go to:
http://osum.sun.com/?xgo=u6u2TApEp3b0uQnDjbhXR25Ah/pbviI-SdX3bY7WTIcscIpZQGX5toNE5tM/2xu6

Class Not found error

2009-01-20 Thread Shyam Sarkar
Hi,

I am following instructions for example wordcount version 2 execution on Hadoop 
installed
under Cygwin. The quick start example worked fine. But word count version 2 is 
giving following
error:

java.lang.ClassNotFoundException: org.myorg.WordCount
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

I checked all files many times in HDFS. They are all in there.

Please help..

shyam_sar...@yahoo.com






Re: NLineInputFormat and very high number of maptasks

2009-01-20 Thread Saptarshi Guha
Sorry, i see - every line is now a maptask - one split,one task.(in  
this case N=1 line per split)

Is that correct?
Saptarshi

On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote:


Hello,
When I use NLIneInputFormat, when I output:
	 
System 
.out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks"));

I see 51, but on the jobtracker site, the number is 18114. Yet with
TextInputFormat it shows 51.
I'm using Hadoop - 0.19

Any ideas why?
Regards
Saptarshi

--
Saptarshi Guha - saptarshi.g...@gmail.com


Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha
If the church put in half the time on covetousness that it does on lust,
this would be a better world.
-- Garrison Keillor, "Lake Wobegon Days"



NLineInputFormat and very high number of maptasks

2009-01-20 Thread Saptarshi Guha
Hello,
When I use NLIneInputFormat, when I output:
System.out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks"));
I see 51, but on the jobtracker site, the number is 18114. Yet with
TextInputFormat it shows 51.
I'm using Hadoop - 0.19

Any ideas why?
Regards
Saptarshi

-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Java RMI and Hadoop RecordIO

2009-01-20 Thread Steve Loughran

David Alves wrote:

Hi
I've been testing some different serialization techniques, to go 
along with a research project.
I know motivation behind hadoop serialization mechanism (e.g. 
Writable) and the enhancement of this feature through record I/O is not 
only performance, but also control of the input/output.
Still I've been running some simple tests and I've foud that plain 
RMi beats Hadoop RecordIO almost every time (14-16% faster).
In my test I have a simple java class that has 14 int fields and 1 
long field and I'm serializing aroung 35000 instances.
Am I doing anything wrong? are there ways to improve performance in 
RecordIO? Have I got the use case wrong?

Regards

David Alves



-. Any speedups are welcome; people are looking at ProtocolBuffers and 
Thrift

- Are you also measuring packet size and deserialization costs?
- add a string or two
- and references to other instances
- then try pushing a few million round the network using the same 
serialization stream instance



I do use RMI a lot at work, once you come up with a plan to deal with 
its brittleness against change (we keep the code in the cluster up to 
date, make no guarantees about compatibility across versions), it is 
easy to use. but it has so many, many problems, and if you hit one, as 
the code is deep in the JVM, it is very hard to deal with. One example, 
RMI tries to send a graph over; it likes to make sure it hasn't pushed a 
copy over earlier. The longer you keep a serialization stream up, the 
slower it gets.


-steve