Re: optimization help needed

2010-03-17 Thread Reik Schatz
Thanks Gang, I will do some testing tomorrow - skip sending whole XML,
maybe adding some Reducers - and see where I end up.

Gang Luo wrote:
> Hi Reik,
> the number of reducer is not a hint (mappers # is a hint). The default hash 
> partitioner will hash and  sent records to each  reducer in round-robin way 
> based on the reducers #. If the values list is too large to fit into heap 
> memory, then you will get an exception and job will fail after several 
> attempts. You may need to increase the heap size for each task by 
> JobConf.set("mapred.child.java.opts","-Xmx***m).
>
> -Gang
>
>
>
>
> - 原始邮件 
> 发件人: Reik Schatz 
> 收件人: "common-user@hadoop.apache.org" 
> 发送日期: 2010/3/17 (周三) 10:13:45 上午
> 主   题: Re: optimization help needed
>
> Very good input not to sent the "original xml" over to the reducers. For
> the JobConf.setNumReduceTasks(n) isn't that just a hint but the real
> number will be determined based on the Partitioner I use, which will be
> the default HashPartioner? One other thought I had, what will happen if
> the values list sent to a single Reducer is to big to fit into memory?
>
> /Reik
>
>
>
> Gang Luo wrote:
>   
>> HI,
>> you can control the number of reducers by JobConf.setNumReduceTasks(n). The 
>> number of mappers is defined by (file size) / (split size). By default the 
>> split size is 64MB. Since you dataset is not very large, there should be no 
>> big difference if you change these. 
>>
>> if you are only interested in the number of blocks per email address, you 
>> don't need to send the "original xml" as the value in the intermediate 
>> result. This can reduce the amount of data sent from mappers to reducers. 
>> Use combiner to pre-aggregate the data may also help.
>>
>> -Gang
>>
>>
>>
>>
>> - 原始邮件 
>> 发件人: Reik Schatz 
>> 收件人: "common-user@hadoop.apache.org" 
>> 发送日期: 2010/3/17 (周三) 5:04:33 上午
>> 主   题: optimization help needed
>>
>> Preparing a Hadoop presentation here. For demonstration I start up a 5 
>> machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 
>> launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into 
>> HDFS. The Mapper will receive a XML block as the key, select a email address 
>> from the xml and use this as the key for the reducer and the orginal xml as 
>> the value. The Reducer just aggregates the number of XML blocks per email 
>> address.
>>
>> Running this on the cluster takes about 2:30 min. The frameworks uses 8 
>> Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in 
>> the file. How can I speed up processing time? One thing I can think of, is 
>> to have more than just 2 email addresses in the sample document to be able 
>> to use more than 2 reducers in parallel. Why did the framework choose to use 
>> 8 mappers and not more? Maybe my sample data is too small to benefit from 
>> parallel processing. 
>> Thanks in advance
>>
>>
>>
>>  
>>  
>> 
>
>   

-- 

*Reik Schatz*
Technical Lead, Platform
P: +46 8 562 470 00
M: +46 76 25 29 872
F: +46 8 562 470 01
E: reik.sch...@bwin.org <mailto:reik.sch...@bwin.org>
*/bwin/* Games AB
Klarabergsviadukten 82,
111 64 Stockholm, Sweden

[This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient (or have received this e-mail in
error) please notify the sender immediately and destroy this e-mail. Any
unauthorised copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.]



Re: optimization help needed

2010-03-17 Thread Reik Schatz
Very good input not to sent the "original xml" over to the reducers. For
the JobConf.setNumReduceTasks(n) isn't that just a hint but the real
number will be determined based on the Partitioner I use, which will be
the default HashPartioner? One other thought I had, what will happen if
the values list sent to a single Reducer is to big to fit into memory?

/Reik



Gang Luo wrote:
> HI,
> you can control the number of reducers by JobConf.setNumReduceTasks(n). The 
> number of mappers is defined by (file size) / (split size). By default the 
> split size is 64MB. Since you dataset is not very large, there should be no 
> big difference if you change these. 
>
> if you are only interested in the number of blocks per email address, you 
> don't need to send the "original xml" as the value in the intermediate 
> result. This can reduce the amount of data sent from mappers to reducers. Use 
> combiner to pre-aggregate the data may also help.
>
> -Gang
>
>
>
>
> - 原始邮件 
> 发件人: Reik Schatz 
> 收件人: "common-user@hadoop.apache.org" 
> 发送日期: 2010/3/17 (周三) 5:04:33 上午
> 主   题: optimization help needed
>
> Preparing a Hadoop presentation here. For demonstration I start up a 5 
> machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 
> launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over into 
> HDFS. The Mapper will receive a XML block as the key, select a email address 
> from the xml and use this as the key for the reducer and the orginal xml as 
> the value. The Reducer just aggregates the number of XML blocks per email 
> address.
>
> Running this on the cluster takes about 2:30 min. The frameworks uses 8 
> Mappers (Spills) and 2 Reducers. About 600.000 xml elements are contained in 
> the file. How can I speed up processing time? One thing I can think of, is to 
> have more than just 2 email addresses in the sample document to be able to 
> use more than 2 reducers in parallel. Why did the framework choose to use 8 
> mappers and not more? Maybe my sample data is too small to benefit from 
> parallel processing. 
> Thanks in advance
>
>
>
>   
>   

-- 

*Reik Schatz*
Technical Lead, Platform
P: +46 8 562 470 00
M: +46 76 25 29 872
F: +46 8 562 470 01
E: reik.sch...@bwin.org <mailto:reik.sch...@bwin.org>
*/bwin/* Games AB
Klarabergsviadukten 82,
111 64 Stockholm, Sweden

[This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient (or have received this e-mail in
error) please notify the sender immediately and destroy this e-mail. Any
unauthorised copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.]



Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-17 Thread Reik Schatz
At least for MRUnit, I was not able to find it outside of the Cloudera 
distribution (CDH). What I did: installing CDH locally using apt 
(Ubuntu), searched for and copied the mrunit library into my local Maven 
repository, and removed CDH after. I guess the same is somehow possible 
for Sqoop.


/Reik

Utku Can Topçu wrote:

Dear All,

I'm trying to run tests using MySQL as some kind of a datasource, so I
thought cloudera's sqoop would be a nice project to have in the production.
However, I'm not using the cloudera's hadoop distribution right now, and
actually I'm not thinking of switching from a main project to a fork.

I read the documentation on sqoop at
http://www.cloudera.com/developers/downloads/sqoop/ but there are actually
no links for downloading the sqoop itself.

Has anyone here know, and tried to use sqoop with the latest apache hadoop?
If so can you give me some tips and tricks on it?

Best Regards,
Utku
  


--

*Reik Schatz*
Technical Lead, Platform
P: +46 8 562 470 00
M: +46 76 25 29 872
F: +46 8 562 470 01
E: reik.sch...@bwin.org <mailto:reik.sch...@bwin.org>
*/bwin/* Games AB
Klarabergsviadukten 82,
111 64 Stockholm, Sweden

[This e-mail may contain confidential and/or privileged information. If 
you are not the intended recipient (or have received this e-mail in 
error) please notify the sender immediately and destroy this e-mail. Any 
unauthorised copying, disclosure or distribution of the material in this 
e-mail is strictly forbidden.]




optimization help needed

2010-03-17 Thread Reik Schatz
Preparing a Hadoop presentation here. For demonstration I start up a 5 
machine m1.large cluster in EC2 via cloudera scripts ($hadoop-ec2 
launch-cluster my-hadoop-cluster 5). Then I sent a 500 MB xml file over 
into HDFS. The Mapper will receive a XML block as the key, select a 
email address from the xml and use this as the key for the reducer and 
the orginal xml as the value. The Reducer just aggregates the number of 
XML blocks per email address.


Running this on the cluster takes about 2:30 min. The frameworks uses 8 
Mappers (Spills) and 2 Reducers. About 600.000 xml elements are 
contained in the file. How can I speed up processing time? One thing I 
can think of, is to have more than just 2 email addresses in the sample 
document to be able to use more than 2 reducers in parallel. Why did the 
framework choose to use 8 mappers and not more? Maybe my sample data is 
too small to benefit from parallel processing. 


Thanks in advance


Re: I want to group "similar" keys in the reducer.

2010-03-15 Thread Reik Schatz
I think what you do in that case is to write your own Partitioner class. 
The default partitioning is based on the hash value. See 
http://wiki.apache.org/hadoop/HadoopMapReduce


Raymond Jennings III wrote:

Is it possible to override a method in the reducer so that similar keys will be grouped together?  
For example I want all keys of value "KEY1" and "KEY2" to merged together.  (My 
reducer has a KEY of type TEXT.)  Thanks.


  
  


--

*Reik Schatz*
Technical Lead, Platform
P: +46 8 562 470 00
M: +46 76 25 29 872
F: +46 8 562 470 01
E: reik.sch...@bwin.org <mailto:reik.sch...@bwin.org>
*/bwin/* Games AB
Klarabergsviadukten 82,
111 64 Stockholm, Sweden

[This e-mail may contain confidential and/or privileged information. If 
you are not the intended recipient (or have received this e-mail in 
error) please notify the sender immediately and destroy this e-mail. Any 
unauthorised copying, disclosure or distribution of the material in this 
e-mail is strictly forbidden.]




Re: using StreamInputFormat, StreamXmlRecordReader with your custom Jobs

2010-03-11 Thread Reik Schatz
Uh, do I have to copy the jar file manually into HDFS before I invoke 
the hadoop jar command starting my own job?




Utkarsh Agarwal wrote:


I think you can use DistributedCache to specify the location of the jar
after you have it in hdfs..

On Wed, Mar 10, 2010 at 6:11 AM, Reik Schatz  wrote:

  

Hi, I am playing around with version 0.20.2 of Hadoop. I have written and
packaged a Job using a custom Mapper and Reducer. The input format in my Job
is set to StreamInputFormat. Also setting property stream.recordreader.class
to org.apache.hadoop.streaming.StreamXmlRecordReader.

This is how I want to start my job:
hadoop jar custom-1.0-SNAPSHOT.jar EmailCountingJob /input /output

The problem is that in this case all classes from
hadoop-0.20.2-streaming.jar are missing (ClassNotFoundException). I tried
using -libjars without luck.
hadoop jar -libjars PATH/hadoop-0.20.2-streaming.jar
custom-1.0-SNAPSHOT.jar EmailCountingJob /input /output

Any chance to use streaming classes with your own Jobs without copying
these classes to your projects and packaging them into your own jar?


/Reik




using StreamInputFormat, StreamXmlRecordReader with your custom Jobs

2010-03-10 Thread Reik Schatz
Hi, I am playing around with version 0.20.2 of Hadoop. I have written 
and packaged a Job using a custom Mapper and Reducer. The input format 
in my Job is set to StreamInputFormat. Also setting property 
stream.recordreader.class to 
org.apache.hadoop.streaming.StreamXmlRecordReader.


This is how I want to start my job:
hadoop jar custom-1.0-SNAPSHOT.jar EmailCountingJob /input /output

The problem is that in this case all classes from 
hadoop-0.20.2-streaming.jar are missing (ClassNotFoundException). I 
tried using -libjars without luck.
hadoop jar -libjars PATH/hadoop-0.20.2-streaming.jar 
custom-1.0-SNAPSHOT.jar EmailCountingJob /input /output


Any chance to use streaming classes with your own Jobs without copying 
these classes to your projects and packaging them into your own jar?



/Reik