Re: Hadoop Python
One area I'm curious about is the requirement that any combiners in Streaming jobs be java classes. Are there any plans to change this in the future? Prototyping streaming jobs in Python is great, and the ability to use a Python combiner would help performance a lot without needing to move to Java. On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah a...@cloudera.com wrote: S d, It is totally fine to use Python streaming if it does the job you are after, there will be a slight performance hit, but that is noise assuming your cluster is a small one. If you are operating a large cluster continuously, then once your logic is stabilized using Python it might make sense to convert/operationalize some jobs to Java (or C pipes) to improve performance for purpose of finishing quicker or reducing number of servers needed. You should also take a look at PIG and Hive, they are both higher level languages and very easy to learn: http://www.cloudera.com/hadoop-training-pig-introduction http://www.cloudera.com/hadoop-training-hive-introduction -- amr s d wrote: Thanks. So in the overall scheme of things, what is the general feeling about using python for this? I like the ease of deploying and reading python compared with Java but want to make sure using python over hadoop is scalable is standard practice and not something done only for prototyping and small scale tests. On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard a...@cloudera.com wrote: Streaming is slightly slower than native Java jobs. Otherwise Python works great in streaming. Alex On Tue, May 19, 2009 at 8:36 AM, s d s.d.sau...@gmail.com wrote: Hi, How robust is using hadoop with python over the streaming protocol? Any disadvantages (performance? flexibility?) ? It just strikes me that python is so much more convenient when it comes to deploying and crunching text files. Thanks, -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Hadoop Python
Whoops, should have googled it first. Looks like this is now fixed in trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be adding something like | sort | sh combiner.sh to the call of the mapper script (via Klaas Bosteels) Would be great to get this patched into distributions like EMR and Cloudera On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch peter.skomor...@gmail.comwrote: One area I'm curious about is the requirement that any combiners in Streaming jobs be java classes. Are there any plans to change this in the future? Prototyping streaming jobs in Python is great, and the ability to use a Python combiner would help performance a lot without needing to move to Java. On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah a...@cloudera.com wrote: S d, It is totally fine to use Python streaming if it does the job you are after, there will be a slight performance hit, but that is noise assuming your cluster is a small one. If you are operating a large cluster continuously, then once your logic is stabilized using Python it might make sense to convert/operationalize some jobs to Java (or C pipes) to improve performance for purpose of finishing quicker or reducing number of servers needed. You should also take a look at PIG and Hive, they are both higher level languages and very easy to learn: http://www.cloudera.com/hadoop-training-pig-introduction http://www.cloudera.com/hadoop-training-hive-introduction -- amr s d wrote: Thanks. So in the overall scheme of things, what is the general feeling about using python for this? I like the ease of deploying and reading python compared with Java but want to make sure using python over hadoop is scalable is standard practice and not something done only for prototyping and small scale tests. On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard a...@cloudera.com wrote: Streaming is slightly slower than native Java jobs. Otherwise Python works great in streaming. Alex On Tue, May 19, 2009 at 8:36 AM, s d s.d.sau...@gmail.com wrote: Hi, How robust is using hadoop with python over the streaming protocol? Any disadvantages (performance? flexibility?) ? It just strikes me that python is so much more convenient when it comes to deploying and crunching text files. Thanks, -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Hadoop Python
Direct link to HADOOP-4842: https://issues.apache.org/jira/browse/HADOOP-4842 On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch peter.skomor...@gmail.comwrote: Whoops, should have googled it first. Looks like this is now fixed in trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be adding something like | sort | sh combiner.sh to the call of the mapper script (via Klaas Bosteels) Would be great to get this patched into distributions like EMR and Cloudera On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch peter.skomor...@gmail.com wrote: One area I'm curious about is the requirement that any combiners in Streaming jobs be java classes. Are there any plans to change this in the future? Prototyping streaming jobs in Python is great, and the ability to use a Python combiner would help performance a lot without needing to move to Java. On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah a...@cloudera.com wrote: S d, It is totally fine to use Python streaming if it does the job you are after, there will be a slight performance hit, but that is noise assuming your cluster is a small one. If you are operating a large cluster continuously, then once your logic is stabilized using Python it might make sense to convert/operationalize some jobs to Java (or C pipes) to improve performance for purpose of finishing quicker or reducing number of servers needed. You should also take a look at PIG and Hive, they are both higher level languages and very easy to learn: http://www.cloudera.com/hadoop-training-pig-introduction http://www.cloudera.com/hadoop-training-hive-introduction -- amr s d wrote: Thanks. So in the overall scheme of things, what is the general feeling about using python for this? I like the ease of deploying and reading python compared with Java but want to make sure using python over hadoop is scalable is standard practice and not something done only for prototyping and small scale tests. On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard a...@cloudera.com wrote: Streaming is slightly slower than native Java jobs. Otherwise Python works great in streaming. Alex On Tue, May 19, 2009 at 8:36 AM, s d s.d.sau...@gmail.com wrote: Hi, How robust is using hadoop with python over the streaming protocol? Any disadvantages (performance? flexibility?) ? It just strikes me that python is so much more convenient when it comes to deploying and crunching text files. Thanks, -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: sort example
1) It is doing alphabetical sort by default, you can force Hadoop streaming to sort numerically with: -D mapred.text.key.comparator.options=-k2,2nr\ see the section A Useful Comparator Class in the streaming docs: http://hadoop.apache.org/core/docs/current/streaming.html and https://issues.apache.org/jira/browse/HADOOP-2302 2) For the second issue, I think you will need to use 1 reducer to guarantee global sort order or use another MR pass. On Sun, May 17, 2009 at 12:14 AM, David Rio driodei...@gmail.com wrote: BTW, Basically, this is the unix equivalent to what I am trying to do: $ cat input_file.txt | sort -n -drd On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com wrote: Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363 To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.reduce.tasks=2 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}} - #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-1 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-0 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys inputReduce2keys . inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: sort example
I just copy and pasted that comparator option from the docs, the -n part is what you want in this case. On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch peter.skomor...@gmail.com wrote: 1) It is doing alphabetical sort by default, you can force Hadoop streaming to sort numerically with: -D mapred.text.key.comparator.options=-k2,2nr\ see the section A Useful Comparator Class in the streaming docs: http://hadoop.apache.org/core/docs/current/streaming.html and https://issues.apache.org/jira/browse/HADOOP-2302 2) For the second issue, I think you will need to use 1 reducer to guarantee global sort order or use another MR pass. On Sun, May 17, 2009 at 12:14 AM, David Rio driodei...@gmail.com wrote: BTW, Basically, this is the unix equivalent to what I am trying to do: $ cat input_file.txt | sort -n -drd On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com wrote: Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363 To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.reduce.tasks=2 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}} - #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-1 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-0 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys inputReduce2keys . inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: How to get jobconf variables in streaming's mapper/reducer?
It took me a while to track this down, Todd is half right (at least for 18.3)... mapred.task.partition actually turns into $mapred_task_partition (note it is lowercase) for example, to get the filename in the mapper of a python streaming job: -- import sys, os filename = os.environ[map_input_file] taskpartition = os.environ[mapred_task_partition] filename will have the form: hdfs://domU-12-31-38-01-6C-F1.compute-1.internal:9000/user/root/myinputs/gzpagecounts/pagecounts-20090501-030001.gz See: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200904.mbox/%3c49e13557.7090...@domaintools.com%3e and http://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java -Pete On Fri, May 15, 2009 at 8:01 PM, Todd Lipcon t...@cloudera.com wrote: Hi Steve, The variables are transformed before going to the mappers. mapred.task.partition turns into $MAPRED_TASK_PARTITION to be more unix-y -Todd On Fri, May 15, 2009 at 4:52 PM, Steve Gao steve@yahoo.com wrote: I am using streaming with perl, and I want to get jobconf variable values. As many tutorials say they are in environment, but I can not get them. For example, in reducer: while (STDIN){ my $part = $ENV{mapred.task.partition}; print ($part\n); } It turns out that $ENV{mapred.task.partition} is not defined. HOWEVER, I can get myself defined variable value. For example: $HADOOP_HOME/bin/hadoop \ jar $HADOOP_HOME/hadoop-streaming.jar \ -input file1 \ -output myOutputDir \ -mapper mapper \ -reducer reducer \ -jobcont arg=test In reducer: while (STDIN){ my $part2 = $ENV{arg}; print ($part2\n); } It works. Anybody knows why is that? How to get jobconf variables in streaming? Thanks lot! -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Fast upload of input data to S3?
Does anyone have upload performance numbers to share or suggested utilities for uploading Hadoop input data to S3 for an EC2 cluster? I'm finding EBS volume transfer to HDFS via put to be extremely slow... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Hadoop / MySQL
Thanks for sharing sounds like a nice system - I always advise people to avoid direct SQL inserts for batch jobs / large amounts of data and use MySQL's optimized LOAD utility like you did. Same goes for Oracle... Nothing brings a DB server to its knees like a ton of individual inserts on indexed tables.. On Tue, Apr 28, 2009 at 6:46 AM, Ankur Goel ankur.g...@corp.aol.com wrote: hello hadoop users, Recently I had a chance to lead a team building a log-processing system that uses Hadoop and MySQL. The system's goal was to process the incoming information as quickly as possible (real time or near real time), and make it available for querying in MySQL. I thought it would be good to share the experience and the challenges with the community. Couldn't think of a better place than these mailing lists as I am not much of a blogger :-) The information flow in the system looks something like [Apache-Servers] - [Hadoop] - [MySQL-shards] - [Query-Tools] Transferring from Apache-Servers to Hadoop was quite easy as we just had to organize the data in timely buckets (directories). Once that was running smooth we had to make sure that map-reduce jobs are fired at regular intervals and they pick up the right data. The jobs would then process/aggregate the date and dump the info into MySQL shards from the reducers [we have our own DB partioning set up]. This is where we hit major bottlenecks [any surprises? :-)] The table engine used was InnoDB as there was a need for fast replication and writes but only moderate reads (should eventually support high read rates). The data would take up quite a while to load completely far away from being near-real time. And so our optimization journey begin. 1. We tried to optimize/tune InnoDB parameters like increasing the buffer pool size to 75 % of available RAM. This helped but only till the time DBs were lightly loaded i.e. innoDB had sufficient buffer pool to host the data and indexes. 2. We also realized that InnoDB has considerable locking overhead because of which write concurrency is really bad when you have a large number of concurrent threads doing writes. The default thread concurrency for us was set to no_of_cpu * 2 = 8 which is what the official documentation advises as the optimal limit. So we limited the number of reduce tasks and consequently the number of concurrent writes and boy the performance improved 4x. We were almost there :-) 3. Next thing we tried is the standard DB optimzation techniques like de-normalizing the schema and dropping constraints. This gave only a minor performance improvement, nothing earth shattering. Note that we were already caching connections in reducers to each MySQL shard and partionining logic was embedded into reducers. 4. Falling still short of our performance objectives, we finally we decided to get rid of JDBC writes from reducers and work on an alternative that uses MySQLs LOAD utility. - The processing would partition the data into MySQL shard specific files resident in HDFS. - A script would then spawn processes via ssh on different physical machines to download this data. - Each spawned process just downloads the data for the shard it should upload to. - All the processes then start uploading data in parallel into their respective MySQL shards using LOAD DATA infile. This proved to be the fastest approach, even in the wake of increasing data loads. The enitre processing/loading would complete in less than 6 min. The system has been holding up quite well so far, even though we've had to limit the number of days for which we keep the data or else the MySQLs get overwhelmed. Hope this is helpful to people. Regards -Ankur -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Hadoop and Matlab
If you can compile the matlab code to an executable with the matlab compiler and send it to the nodes with the distributed cache that should work... You probably want to avoid licensing fees for running copies of matlab itself on the cluster. Sent from my iPhone On Apr 21, 2009, at 1:55 PM, Sameer Tilak sameer.u...@gmail.com wrote: Hi there, We're working on an image analysis project. The image processing code is written in Matlab. If I invoke that code from a shell script and then use that shell script within Hadoop streaming, will that work? Has anyone done something along these lines? Many thaks, --ST.
Re: Hadoop streaming performance: elements vs. vectors
Amareshwari, Thanks for the suggestion, can you show a streaming jobconf that uses mapred.job.classpath.archives to add a custom combiner to the classpath? I've tried several variations, but the jar doesn't seem to get added to the classpath properly... -Pete On Mon, Apr 6, 2009 at 12:17 AM, Amareshwari Sriramadasu amar...@yahoo-inc.com wrote: You can add your jar to distributed cache and add it to classpath by passing it in configuration propery - mapred.job.classpath.archives. -Amareshwari Peter Skomoroch wrote: If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a way to add it to the classpath without the following patch? https://issues.apache.org/jira/browse/HADOOP-3570 http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch peter.skomor...@gmail.comwrote: Paco, Thanks, good ideas on the combiner. I'm going to tweak things a bit as you suggest and report back later... -Pete On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN cet...@gmail.com wrote: hi peter, thinking aloud on this - trade-offs may depend on: * how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if so, it's probably cheaper to shuffle/sort the grouped row vectors than the many small key,value pair in any case, when i had a similar situation on a large data set (2-3 Tb shuffle) a good pattern to follow was: * mapper emitted small key,value pairs * combiner grouped into row vectors that combiner may get invoked both at the end of the map phase and at the beginning of the reduce phase (more benefit) also, using byte arrays if possible to represent values may be able to save much shuffle time best, paco On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch peter.skomor...@gmail.com wrote: Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers? For example, say I'm summing frequency count matrices M for each user on a different map task, and the reducer combines the resulting sparse user count matrices for use in another calculation. Should I emit the individual elements: i (j, Mij) \n 3 (1, 3.4) \n 3 (2, 3.4) \n 3 (3, 3.4) \n 4 (1, 2.3) \n 4 (2, 5.2) \n Or posting list style vectors? 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n 4 ((1, 2.3), (2, 5.2)) \n Using vectors will at least save some message space, but are there any other benefits to this approach in terms of Hadoop streaming overhead (sorts etc.)? I think buffering issues will not be a huge concern since the length of the vectors have a reasonable upper bound and will be in a sparse format... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Amazon Elastic MapReduce
Intermediate results can be stored in hdfs on the EC2 machines, or in S3 using s3n... performance is better if you store on hdfs: -input, s3n://elasticmapreduce/samples/similarity/lastfm/input/, -output,hdfs:///home/hadoop/output2/, On Mon, Apr 6, 2009 at 11:27 AM, Patrick A. patrickange...@gmail.comwrote: Are intermediate results stored in S3 as well? Also, any plans to support HTable? Chris K Wensel-2 wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- View this message in context: http://www.nabble.com/Amazon-Elastic-MapReduce-tp22842658p22911128.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Hadoop streaming performance: elements vs. vectors
If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a way to add it to the classpath without the following patch? https://issues.apache.org/jira/browse/HADOOP-3570 http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch peter.skomor...@gmail.comwrote: Paco, Thanks, good ideas on the combiner. I'm going to tweak things a bit as you suggest and report back later... -Pete On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN cet...@gmail.com wrote: hi peter, thinking aloud on this - trade-offs may depend on: * how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if so, it's probably cheaper to shuffle/sort the grouped row vectors than the many small key,value pair in any case, when i had a similar situation on a large data set (2-3 Tb shuffle) a good pattern to follow was: * mapper emitted small key,value pairs * combiner grouped into row vectors that combiner may get invoked both at the end of the map phase and at the beginning of the reduce phase (more benefit) also, using byte arrays if possible to represent values may be able to save much shuffle time best, paco On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch peter.skomor...@gmail.com wrote: Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers? For example, say I'm summing frequency count matrices M for each user on a different map task, and the reducer combines the resulting sparse user count matrices for use in another calculation. Should I emit the individual elements: i (j, Mij) \n 3 (1, 3.4) \n 3 (2, 3.4) \n 3 (3, 3.4) \n 4 (1, 2.3) \n 4 (2, 5.2) \n Or posting list style vectors? 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n 4 ((1, 2.3), (2, 5.2)) \n Using vectors will at least save some message space, but are there any other benefits to this approach in terms of Hadoop streaming overhead (sorts etc.)? I think buffering issues will not be a huge concern since the length of the vectors have a reasonable upper bound and will be in a sparse format... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Amazon Elastic MapReduce
Kevin, The API accepts any arguments you can pass in the standard jobconf for Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON job description that will run on the service. -Pete On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson kpeter...@biz360.com wrote: So if I understand correctly, this is an automated system to bring up a hadoop cluster on EC2, import some data from S3, run a job flow, write the data back to S3, and bring down the cluster? This seems like a pretty good deal. At the pricing they are offering, unless I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be cheaper to use this new service. Does this use an existing Hadoop job control API, or do I need to write my flows to conform to Amazon's API? -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Hadoop streaming performance: elements vs. vectors
Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers? For example, say I'm summing frequency count matrices M for each user on a different map task, and the reducer combines the resulting sparse user count matrices for use in another calculation. Should I emit the individual elements: i (j, Mij) \n 3 (1, 3.4) \n 3 (2, 3.4) \n 3 (3, 3.4) \n 4 (1, 2.3) \n 4 (2, 5.2) \n Or posting list style vectors? 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n 4 ((1, 2.3), (2, 5.2)) \n Using vectors will at least save some message space, but are there any other benefits to this approach in terms of Hadoop streaming overhead (sorts etc.)? I think buffering issues will not be a huge concern since the length of the vectors have a reasonable upper bound and will be in a sparse format... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Hadoop streaming performance: elements vs. vectors
Paco, Thanks, good ideas on the combiner. I'm going to tweak things a bit as you suggest and report back later... -Pete On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN cet...@gmail.com wrote: hi peter, thinking aloud on this - trade-offs may depend on: * how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if so, it's probably cheaper to shuffle/sort the grouped row vectors than the many small key,value pair in any case, when i had a similar situation on a large data set (2-3 Tb shuffle) a good pattern to follow was: * mapper emitted small key,value pairs * combiner grouped into row vectors that combiner may get invoked both at the end of the map phase and at the beginning of the reduce phase (more benefit) also, using byte arrays if possible to represent values may be able to save much shuffle time best, paco On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch peter.skomor...@gmail.com wrote: Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers? For example, say I'm summing frequency count matrices M for each user on a different map task, and the reducer combines the resulting sparse user count matrices for use in another calculation. Should I emit the individual elements: i (j, Mij) \n 3 (1, 3.4) \n 3 (2, 3.4) \n 3 (3, 3.4) \n 4 (1, 2.3) \n 4 (2, 5.2) \n Or posting list style vectors? 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n 4 ((1, 2.3), (2, 5.2)) \n Using vectors will at least save some message space, but are there any other benefits to this approach in terms of Hadoop streaming overhead (sorts etc.)? I think buffering issues will not be a huge concern since the length of the vectors have a reasonable upper bound and will be in a sparse format... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: Iterative feedback in map reduce....
Check out the EM example in nltk: http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk_contrib/hadoop/EM/runStreaming.py On Fri, Mar 27, 2009 at 5:19 PM, Sid123 itis...@gmail.com wrote: HI, I have to design an iterative algorithm, each iteration is a M-R cycle that calculates a parameter and has to feed it back to all the maps in the next iteration. Now the reduce procedure I need to just sum eveything from the Map procedure(Many similar size matrices) into a single matrix(of same size as each reduce ), irrespective of the key. This single matrix is the parameter I was taking about earlier. i want to know. PS This parameter MUST BE global to all map processes. 1) How do I collect all the values into one single parameter? Do I need to write it to the File system or can i keep it in memory? I feel that I WILL have to write it to the HDFS somewhere... -- View this message in context: http://www.nabble.com/Iterative-feedback-in-map-reduce-tp22748317p22748317.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch