Re: Hadoop Python

2009-05-19 Thread Peter Skomoroch
One area I'm curious about is the requirement that any combiners in
Streaming jobs be java classes.  Are there any plans to change this in the
future?  Prototyping streaming jobs in Python is great, and the ability to
use a Python combiner would help performance a lot without needing to move
to Java.



On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah a...@cloudera.com wrote:

 S d,

  It is totally fine to use Python streaming if it does the job you are
 after, there will be a slight performance hit, but that is noise assuming
 your cluster is a small one. If you are operating a large cluster
 continuously, then once your logic is stabilized using Python it might make
 sense to convert/operationalize some jobs to Java (or C pipes) to improve
 performance for purpose of finishing quicker or reducing number of servers
 needed.

  You should also take a look at PIG and Hive, they are both higher level
 languages and very easy to learn:

 http://www.cloudera.com/hadoop-training-pig-introduction

 http://www.cloudera.com/hadoop-training-hive-introduction

 -- amr


 s d wrote:

 Thanks.
 So in the overall scheme of things, what is the general feeling about
 using
 python for this? I like the ease of deploying and reading python compared
 with Java but want to make sure using python over hadoop is scalable  is
 standard practice and not something done only for prototyping and small
 scale tests.


 On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard a...@cloudera.com
 wrote:



 Streaming is slightly slower than native Java jobs.  Otherwise Python
 works
 great in streaming.

 Alex

 On Tue, May 19, 2009 at 8:36 AM, s d s.d.sau...@gmail.com wrote:



 Hi,
 How robust is using hadoop with python over the streaming protocol? Any
 disadvantages (performance? flexibility?) ?  It just strikes me that


 python


 is so much more convenient when it comes to deploying and crunching text
 files.
 Thanks,









-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Hadoop Python

2009-05-19 Thread Peter Skomoroch
Whoops, should have googled it first.  Looks like this is now fixed in
trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to be
adding something like | sort | sh combiner.sh to the call of the mapper
script (via Klaas Bosteels)

Would be great to get this patched into distributions like EMR and Cloudera

On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
peter.skomor...@gmail.comwrote:

 One area I'm curious about is the requirement that any combiners in
 Streaming jobs be java classes.  Are there any plans to change this in the
 future?  Prototyping streaming jobs in Python is great, and the ability to
 use a Python combiner would help performance a lot without needing to move
 to Java.




 On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah a...@cloudera.com wrote:

 S d,

  It is totally fine to use Python streaming if it does the job you are
 after, there will be a slight performance hit, but that is noise assuming
 your cluster is a small one. If you are operating a large cluster
 continuously, then once your logic is stabilized using Python it might make
 sense to convert/operationalize some jobs to Java (or C pipes) to improve
 performance for purpose of finishing quicker or reducing number of servers
 needed.

  You should also take a look at PIG and Hive, they are both higher level
 languages and very easy to learn:

 http://www.cloudera.com/hadoop-training-pig-introduction

 http://www.cloudera.com/hadoop-training-hive-introduction

 -- amr


 s d wrote:

 Thanks.
 So in the overall scheme of things, what is the general feeling about
 using
 python for this? I like the ease of deploying and reading python compared
 with Java but want to make sure using python over hadoop is scalable  is
 standard practice and not something done only for prototyping and small
 scale tests.


 On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard a...@cloudera.com
 wrote:



 Streaming is slightly slower than native Java jobs.  Otherwise Python
 works
 great in streaming.

 Alex

 On Tue, May 19, 2009 at 8:36 AM, s d s.d.sau...@gmail.com wrote:



 Hi,
 How robust is using hadoop with python over the streaming protocol? Any
 disadvantages (performance? flexibility?) ?  It just strikes me that


 python


 is so much more convenient when it comes to deploying and crunching
 text
 files.
 Thanks,









 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Hadoop Python

2009-05-19 Thread Peter Skomoroch
Direct link to HADOOP-4842:

https://issues.apache.org/jira/browse/HADOOP-4842

On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch
peter.skomor...@gmail.comwrote:

 Whoops, should have googled it first.  Looks like this is now fixed in
 trunk, HADOOP-4842.  For people stuck using 18.3, a workaround appears to be
 adding something like | sort | sh combiner.sh to the call of the mapper
 script (via Klaas Bosteels)

 Would be great to get this patched into distributions like EMR and Cloudera


 On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch 
 peter.skomor...@gmail.com wrote:

 One area I'm curious about is the requirement that any combiners in
 Streaming jobs be java classes.  Are there any plans to change this in the
 future?  Prototyping streaming jobs in Python is great, and the ability to
 use a Python combiner would help performance a lot without needing to move
 to Java.




 On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah a...@cloudera.com wrote:

 S d,

  It is totally fine to use Python streaming if it does the job you are
 after, there will be a slight performance hit, but that is noise assuming
 your cluster is a small one. If you are operating a large cluster
 continuously, then once your logic is stabilized using Python it might make
 sense to convert/operationalize some jobs to Java (or C pipes) to improve
 performance for purpose of finishing quicker or reducing number of servers
 needed.

  You should also take a look at PIG and Hive, they are both higher level
 languages and very easy to learn:

 http://www.cloudera.com/hadoop-training-pig-introduction

 http://www.cloudera.com/hadoop-training-hive-introduction

 -- amr


 s d wrote:

 Thanks.
 So in the overall scheme of things, what is the general feeling about
 using
 python for this? I like the ease of deploying and reading python
 compared
 with Java but want to make sure using python over hadoop is scalable 
 is
 standard practice and not something done only for prototyping and small
 scale tests.


 On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard a...@cloudera.com
 wrote:



 Streaming is slightly slower than native Java jobs.  Otherwise Python
 works
 great in streaming.

 Alex

 On Tue, May 19, 2009 at 8:36 AM, s d s.d.sau...@gmail.com wrote:



 Hi,
 How robust is using hadoop with python over the streaming protocol?
 Any
 disadvantages (performance? flexibility?) ?  It just strikes me that


 python


 is so much more convenient when it comes to deploying and crunching
 text
 files.
 Thanks,









 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch




 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: sort example

2009-05-16 Thread Peter Skomoroch
1) It is doing alphabetical sort by default, you can force Hadoop streaming
to sort numerically with:

-D mapred.text.key.comparator.options=-k2,2nr\

see the section A Useful Comparator Class in the streaming docs:

http://hadoop.apache.org/core/docs/current/streaming.html
and https://issues.apache.org/jira/browse/HADOOP-2302

2) For the second issue, I think you will need to use 1 reducer to guarantee
global sort order or use another MR pass.


On Sun, May 17, 2009 at 12:14 AM, David Rio driodei...@gmail.com wrote:

 BTW,
 Basically, this is the unix equivalent to what I am trying to do:
 $ cat input_file.txt | sort -n
 -drd

 On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com wrote:

  Hi,
  I am trying to sort some data with hadoop(streaming mode). The input
looks
  like:
   $ cat small_numbers.txt
  9971681
  9686036
  2592322
  4518219
  1467363
 
  To send my job to the cluster I use:
  hadoop jar
  /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \
  -D mapred.reduce.tasks=2 \
  -D stream.num.map.output.key.fields=1 \
  -D mapred.text.key.comparator.options=-k1,1n \
  -input /input \
  -output /output \
  -mapper sort_mapper.rb \
  -file `pwd`/scripts_sort/sort_mapper.rb \
  -reducer sort_reducer.rb \
  -file `pwd`/scripts_sort/sort_reducer.rb
 
  The mapper code basically writes key, value = input_line, input_line.
  The reducer just prints the keys from the standard input.
  Incase you care:
   $ cat scripts_sort/sort_*
  #!/usr/bin/ruby
 
  STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}}
  -
  #!/usr/bin/ruby
 
  STDIN.each_line { |line| puts line.split[0] }
  I run the job and it completes without problems, the output looks like:
  d...@milhouse:~/tmp $ cat output/part-1
  1380664
  1467363
  32485
  3857847
  422538
  4354952
  4518219
  5719091
  7838358
  9686036
  d...@milhouse:~/tmp $ cat output/part-0
  1453024
  2592322
  3875994
  4689583
  5340522
  607354
  6447778
  6535495
  8647464
  9971681
  These are my questions:
  1. It seems the sorting (per reducer) is working but I don't know why,
for
  example,
  607354 is not the first number in the output.
 
  2. How can I tell hadoop to send data to the reduces in such a way that
  inputReduce1keys 
  inputReduce2keys  .  inputReduceNkeys. In that way I would ensure
the
  data
  is fully sorted once the job is done.
  I've tried also using the identity classes for the mapper and reducer
but
  the job dies generating
  exceptions about the input format.
  Can anyone show me or point me to some code showing how to properly
perform
  sorting.
  Thanks in advance,
  -drd
 
 



--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: sort example

2009-05-16 Thread Peter Skomoroch
I just copy and pasted that comparator option from the docs, the -n part is
what you want in this case.

On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch peter.skomor...@gmail.com
 wrote:

 1) It is doing alphabetical sort by default, you can force Hadoop streaming
 to sort numerically with:

 -D mapred.text.key.comparator.options=-k2,2nr\

 see the section A Useful Comparator Class in the streaming docs:

 http://hadoop.apache.org/core/docs/current/streaming.html
 and https://issues.apache.org/jira/browse/HADOOP-2302

 2) For the second issue, I think you will need to use 1 reducer to
 guarantee global sort order or use another MR pass.



 On Sun, May 17, 2009 at 12:14 AM, David Rio driodei...@gmail.com wrote:
 
  BTW,
  Basically, this is the unix equivalent to what I am trying to do:
  $ cat input_file.txt | sort -n
  -drd
 
  On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com
 wrote:
 
   Hi,
   I am trying to sort some data with hadoop(streaming mode). The input
 looks
   like:
$ cat small_numbers.txt
   9971681
   9686036
   2592322
   4518219
   1467363
  
   To send my job to the cluster I use:
   hadoop jar
   /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar
 \
   -D mapred.reduce.tasks=2 \
   -D stream.num.map.output.key.fields=1 \
   -D mapred.text.key.comparator.options=-k1,1n \
   -input /input \
   -output /output \
   -mapper sort_mapper.rb \
   -file `pwd`/scripts_sort/sort_mapper.rb \
   -reducer sort_reducer.rb \
   -file `pwd`/scripts_sort/sort_reducer.rb
  
   The mapper code basically writes key, value = input_line, input_line.
   The reducer just prints the keys from the standard input.
   Incase you care:
$ cat scripts_sort/sort_*
   #!/usr/bin/ruby
  
   STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}}
   -
   #!/usr/bin/ruby
  
   STDIN.each_line { |line| puts line.split[0] }
   I run the job and it completes without problems, the output looks like:
   d...@milhouse:~/tmp $ cat output/part-1
   1380664
   1467363
   32485
   3857847
   422538
   4354952
   4518219
   5719091
   7838358
   9686036
   d...@milhouse:~/tmp $ cat output/part-0
   1453024
   2592322
   3875994
   4689583
   5340522
   607354
   6447778
   6535495
   8647464
   9971681
   These are my questions:
   1. It seems the sorting (per reducer) is working but I don't know why,
 for
   example,
   607354 is not the first number in the output.
  
   2. How can I tell hadoop to send data to the reduces in such a way that
   inputReduce1keys 
   inputReduce2keys  .  inputReduceNkeys. In that way I would ensure
 the
   data
   is fully sorted once the job is done.
   I've tried also using the identity classes for the mapper and reducer
 but
   the job dies generating
   exceptions about the input format.
   Can anyone show me or point me to some code showing how to properly
 perform
   sorting.
   Thanks in advance,
   -drd
  
  



 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: How to get jobconf variables in streaming's mapper/reducer?

2009-05-15 Thread Peter Skomoroch
It took me a while to track this down, Todd is half right (at least for
18.3)...

mapred.task.partition actually turns into $mapred_task_partition  (note it
is lowercase)

for example, to get the filename in the mapper of a python streaming job:

--

import sys, os
filename = os.environ[map_input_file]
taskpartition = os.environ[mapred_task_partition]

filename will have the form:

hdfs://domU-12-31-38-01-6C-F1.compute-1.internal:9000/user/root/myinputs/gzpagecounts/pagecounts-20090501-030001.gz

See:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200904.mbox/%3c49e13557.7090...@domaintools.com%3e

and

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java

-Pete

On Fri, May 15, 2009 at 8:01 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Steve,

 The variables are transformed before going to the mappers.
 mapred.task.partition turns into $MAPRED_TASK_PARTITION to be more unix-y

 -Todd

 On Fri, May 15, 2009 at 4:52 PM, Steve Gao steve@yahoo.com wrote:

  I am using streaming with perl, and I want to get jobconf variable
 values.
  As many tutorials say they are in environment, but I can not get them.
 
  For example, in reducer:
  while (STDIN){
my $part = $ENV{mapred.task.partition};
print ($part\n);
  }
 
  It turns out that  $ENV{mapred.task.partition} is not defined.
 
  HOWEVER, I can get myself defined variable value. For example:
 
   $HADOOP_HOME/bin/hadoop  \
   jar $HADOOP_HOME/hadoop-streaming.jar \
   -input file1 \
   -output myOutputDir \
   -mapper mapper \
   -reducer reducer \
   -jobcont arg=test
 
  In reducer:
 
  while (STDIN){
 
my $part2 = $ENV{arg};
 
print ($part2\n);
 
  }
 
 
  It works.
 
  Anybody knows why is that? How to get jobconf variables in streaming?
  Thanks lot!
 
 
 
 




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Fast upload of input data to S3?

2009-05-14 Thread Peter Skomoroch
Does anyone have upload performance numbers to share or suggested utilities
for uploading Hadoop input data to S3 for an EC2 cluster?

I'm finding EBS volume transfer to HDFS via put to be extremely slow...

-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Hadoop / MySQL

2009-04-28 Thread Peter Skomoroch
Thanks for sharing sounds like a nice system - I always advise people to
avoid direct SQL inserts for batch jobs / large amounts of data and use
MySQL's optimized LOAD utility like you did.  Same goes for Oracle...
Nothing brings a DB server to its knees like a ton of individual inserts on
indexed tables..

On Tue, Apr 28, 2009 at 6:46 AM, Ankur Goel ankur.g...@corp.aol.com wrote:


 hello hadoop users,
 Recently I had a chance to lead a team building a log-processing system
 that uses Hadoop and MySQL. The system's goal was to process the incoming
 information as quickly as possible (real time or near real time), and make
 it available for querying in MySQL. I thought it would be good to share the
 experience and the challenges with the community. Couldn't think of a better
 place than these mailing lists as I am not much of a blogger :-)

 The information flow in the system looks something like

 [Apache-Servers] - [Hadoop] - [MySQL-shards] - [Query-Tools]

 Transferring from Apache-Servers to Hadoop was quite easy as we just had to
 organize the data in timely buckets (directories). Once that was running
 smooth we had to make sure that map-reduce jobs are fired at regular
 intervals and they pick up the right data. The jobs would then
 process/aggregate the date and dump the info into MySQL shards from the
 reducers [we have our own DB partioning set up]. This is where we hit major
 bottlenecks [any surprises? :-)]

 The table engine used was InnoDB as there was a need for fast replication
 and writes but only moderate reads (should eventually support high read
 rates). The data would take up quite a while to load completely far away
 from being near-real time. And so our optimization journey begin.

 1. We tried to optimize/tune InnoDB parameters like increasing the buffer
 pool size to 75 % of available RAM. This helped but only till the time DBs
 were lightly loaded i.e. innoDB had sufficient buffer pool to host the data
 and indexes.

 2. We also realized that InnoDB has considerable locking overhead because
 of which write concurrency is really bad when you have a large number of
 concurrent threads doing writes. The default thread concurrency for us was
 set to no_of_cpu * 2 = 8 which is what the official documentation advises as
 the optimal limit. So we limited the number of reduce tasks and consequently
 the number of concurrent writes and boy the performance improved 4x. We were
 almost there :-)

 3. Next thing we tried is the standard DB optimzation techniques like
 de-normalizing the schema and dropping constraints. This gave only a minor
 performance improvement, nothing earth shattering. Note that we were already
 caching connections in reducers to each MySQL shard and partionining logic
 was embedded into reducers.

 4. Falling still short of our performance objectives, we finally we decided
 to get rid of JDBC writes from reducers and work on an alternative that uses
 MySQLs LOAD utility.
 - The processing would partition the data into MySQL shard specific files
 resident in HDFS.
 - A script would then spawn processes via ssh on different physical
 machines to download this data.
 - Each spawned process just downloads the data for the shard it should
 upload to.
 - All the processes then start uploading data in parallel into their
 respective MySQL shards using LOAD DATA infile.

 This proved to be the fastest approach, even in the wake of increasing data
 loads. The enitre processing/loading would complete in less than 6 min. The
 system has been holding up quite well so far, even though we've had to limit
 the number of days for which we keep the data or else the MySQLs get
 overwhelmed.

 Hope this is helpful to people.

 Regards
 -Ankur




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Hadoop and Matlab

2009-04-21 Thread Peter Skomoroch
If you can compile the matlab code to an executable with the matlab  
compiler and send it to the nodes with the distributed cache that  
should work... You probably want to avoid licensing fees for running  
copies of matlab itself on the cluster.


Sent from my iPhone

On Apr 21, 2009, at 1:55 PM, Sameer Tilak sameer.u...@gmail.com wrote:


Hi there,

We're working on an image analysis project. The image processing  
code is
written in Matlab. If I invoke that code from a shell script and  
then use
that shell script within Hadoop streaming, will that work? Has  
anyone done

something along these lines?

Many thaks,
--ST.


Re: Hadoop streaming performance: elements vs. vectors

2009-04-07 Thread Peter Skomoroch
Amareshwari,

Thanks for the suggestion, can you show a streaming jobconf that uses
mapred.job.classpath.archives to add a custom combiner to the classpath?

I've tried several variations, but the jar doesn't seem to get added to the
classpath properly...

-Pete

On Mon, Apr 6, 2009 at 12:17 AM, Amareshwari Sriramadasu 
amar...@yahoo-inc.com wrote:

 You can add your jar to distributed cache and add it to classpath by
 passing it in configuration propery - mapred.job.classpath.archives.

 -Amareshwari

 Peter Skomoroch wrote:

 If I need to use a custom streaming combiner jar in Hadoop 18.3, is there
 a
 way to add it to the classpath without the following patch?

 https://issues.apache.org/jira/browse/HADOOP-3570


 http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e

 On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
 peter.skomor...@gmail.comwrote:



 Paco,

 Thanks, good ideas on the combiner.  I'm going to tweak things a bit as
 you
 suggest and report back later...

 -Pete


 On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN cet...@gmail.com wrote:



 hi peter,
 thinking aloud on this -

 trade-offs may depend on:

  * how much grouping would be possible (tracking a PDF would be
 interesting for metrics)
  * locality of key/value pairs (distributed among mapper and reducer
 tasks)

 to that point, will there be much time spent in the shuffle?  if so,
 it's probably cheaper to shuffle/sort the grouped row vectors than the
 many small key,value pair

 in any case, when i had a similar situation on a large data set (2-3
 Tb shuffle) a good pattern to follow was:

  * mapper emitted small key,value pairs
  * combiner grouped into row vectors

 that combiner may get invoked both at the end of the map phase and at
 the beginning of the reduce phase (more benefit)

 also, using byte arrays if possible to represent values may be able to
 save much shuffle time

 best,
 paco


 On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
 peter.skomor...@gmail.com wrote:


 Hadoop streaming question: If I am forming a matrix M by summing a


 number of


 elements generated on different mappers, is it better to emit tons of


 lines


 from the mappers with small key,value pairs for each element, or should


 I


 group them into row vectors before sending to the reducers?

 For example, say I'm summing frequency count matrices M for each user
 on


 a


 different map task, and the reducer combines the resulting sparse user


 count


 matrices for use in another calculation.

 Should I emit the individual elements:

 i (j, Mij) \n
 3 (1, 3.4) \n
 3 (2, 3.4) \n
 3 (3, 3.4) \n
 4 (1, 2.3) \n
 4 (2, 5.2) \n

 Or posting list style vectors?

 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
 4 ((1, 2.3), (2, 5.2)) \n

 Using vectors will at least save some message space, but are there any


 other


 benefits to this approach in terms of Hadoop streaming overhead (sorts
 etc.)?  I think buffering issues will not be a huge concern since the


 length


 of the vectors have a reasonable upper bound and will be in a sparse
 format...


 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch




 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch












-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Amazon Elastic MapReduce

2009-04-06 Thread Peter Skomoroch
Intermediate results can be stored in hdfs on the EC2 machines, or in S3
using s3n... performance is better if you store on hdfs:

 -input,
s3n://elasticmapreduce/samples/similarity/lastfm/input/,
 -output,hdfs:///home/hadoop/output2/,



On Mon, Apr 6, 2009 at 11:27 AM, Patrick A. patrickange...@gmail.comwrote:


 Are intermediate results stored in S3 as well?

 Also, any plans to support HTable?



 Chris K Wensel-2 wrote:
 
 
  FYI
 
  Amazons new Hadoop offering:
  http://aws.amazon.com/elasticmapreduce/
 
  And Cascading 1.0 supports it:
  http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
 
  cheers,
  ckw
 
  --
  Chris K Wensel
  ch...@wensel.net
  http://www.cascading.org/
  http://www.scaleunlimited.com/
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Amazon-Elastic-MapReduce-tp22842658p22911128.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Hadoop streaming performance: elements vs. vectors

2009-04-05 Thread Peter Skomoroch
If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
way to add it to the classpath without the following patch?

https://issues.apache.org/jira/browse/HADOOP-3570

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3c48cf78e3.10...@yahoo-inc.com%3e

On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
peter.skomor...@gmail.comwrote:

 Paco,

 Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
 suggest and report back later...

 -Pete


 On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN cet...@gmail.com wrote:

 hi peter,
 thinking aloud on this -

 trade-offs may depend on:

   * how much grouping would be possible (tracking a PDF would be
 interesting for metrics)
   * locality of key/value pairs (distributed among mapper and reducer
 tasks)

 to that point, will there be much time spent in the shuffle?  if so,
 it's probably cheaper to shuffle/sort the grouped row vectors than the
 many small key,value pair

 in any case, when i had a similar situation on a large data set (2-3
 Tb shuffle) a good pattern to follow was:

   * mapper emitted small key,value pairs
   * combiner grouped into row vectors

 that combiner may get invoked both at the end of the map phase and at
 the beginning of the reduce phase (more benefit)

 also, using byte arrays if possible to represent values may be able to
 save much shuffle time

 best,
 paco


 On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
 peter.skomor...@gmail.com wrote:
  Hadoop streaming question: If I am forming a matrix M by summing a
 number of
  elements generated on different mappers, is it better to emit tons of
 lines
  from the mappers with small key,value pairs for each element, or should
 I
  group them into row vectors before sending to the reducers?
 
  For example, say I'm summing frequency count matrices M for each user on
 a
  different map task, and the reducer combines the resulting sparse user
 count
  matrices for use in another calculation.
 
  Should I emit the individual elements:
 
  i (j, Mij) \n
  3 (1, 3.4) \n
  3 (2, 3.4) \n
  3 (3, 3.4) \n
  4 (1, 2.3) \n
  4 (2, 5.2) \n
 
  Or posting list style vectors?
 
  3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
  4 ((1, 2.3), (2, 5.2)) \n
 
  Using vectors will at least save some message space, but are there any
 other
  benefits to this approach in terms of Hadoop streaming overhead (sorts
  etc.)?  I think buffering issues will not be a huge concern since the
 length
  of the vectors have a reasonable upper bound and will be in a sparse
  format...
 
 
  --
  Peter N. Skomoroch
  617.285.8348
  http://www.datawrangling.com
  http://delicious.com/pskomoroch
  http://twitter.com/peteskomoroch
 




 --
 Peter N. Skomoroch
 617.285.8348
 http://www.datawrangling.com
 http://delicious.com/pskomoroch
 http://twitter.com/peteskomoroch




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Amazon Elastic MapReduce

2009-04-02 Thread Peter Skomoroch
Kevin,

The API accepts any arguments you can pass in the standard jobconf for
Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON
job description that will run on the service.

-Pete

On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson kpeter...@biz360.com wrote:

 So if I understand correctly, this is an automated system to bring up a
 hadoop cluster on EC2, import some data from S3, run a job flow, write the
 data back to S3, and bring down the cluster?

 This seems like a pretty good deal. At the pricing they are offering,
 unless
 I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
 cheaper to use this new service.

 Does this use an existing Hadoop job control API, or do I need to write my
 flows to conform to Amazon's API?




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Peter Skomoroch
Hadoop streaming question: If I am forming a matrix M by summing a number of
elements generated on different mappers, is it better to emit tons of lines
from the mappers with small key,value pairs for each element, or should I
group them into row vectors before sending to the reducers?

For example, say I'm summing frequency count matrices M for each user on a
different map task, and the reducer combines the resulting sparse user count
matrices for use in another calculation.

Should I emit the individual elements:

i (j, Mij) \n
3 (1, 3.4) \n
3 (2, 3.4) \n
3 (3, 3.4) \n
4 (1, 2.3) \n
4 (2, 5.2) \n

Or posting list style vectors?

3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
4 ((1, 2.3), (2, 5.2)) \n

Using vectors will at least save some message space, but are there any other
benefits to this approach in terms of Hadoop streaming overhead (sorts
etc.)?  I think buffering issues will not be a huge concern since the length
of the vectors have a reasonable upper bound and will be in a sparse
format...


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Peter Skomoroch
Paco,

Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
suggest and report back later...

-Pete

On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN cet...@gmail.com wrote:

 hi peter,
 thinking aloud on this -

 trade-offs may depend on:

   * how much grouping would be possible (tracking a PDF would be
 interesting for metrics)
   * locality of key/value pairs (distributed among mapper and reducer
 tasks)

 to that point, will there be much time spent in the shuffle?  if so,
 it's probably cheaper to shuffle/sort the grouped row vectors than the
 many small key,value pair

 in any case, when i had a similar situation on a large data set (2-3
 Tb shuffle) a good pattern to follow was:

   * mapper emitted small key,value pairs
   * combiner grouped into row vectors

 that combiner may get invoked both at the end of the map phase and at
 the beginning of the reduce phase (more benefit)

 also, using byte arrays if possible to represent values may be able to
 save much shuffle time

 best,
 paco


 On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
 peter.skomor...@gmail.com wrote:
  Hadoop streaming question: If I am forming a matrix M by summing a number
 of
  elements generated on different mappers, is it better to emit tons of
 lines
  from the mappers with small key,value pairs for each element, or should I
  group them into row vectors before sending to the reducers?
 
  For example, say I'm summing frequency count matrices M for each user on
 a
  different map task, and the reducer combines the resulting sparse user
 count
  matrices for use in another calculation.
 
  Should I emit the individual elements:
 
  i (j, Mij) \n
  3 (1, 3.4) \n
  3 (2, 3.4) \n
  3 (3, 3.4) \n
  4 (1, 2.3) \n
  4 (2, 5.2) \n
 
  Or posting list style vectors?
 
  3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
  4 ((1, 2.3), (2, 5.2)) \n
 
  Using vectors will at least save some message space, but are there any
 other
  benefits to this approach in terms of Hadoop streaming overhead (sorts
  etc.)?  I think buffering issues will not be a huge concern since the
 length
  of the vectors have a reasonable upper bound and will be in a sparse
  format...
 
 
  --
  Peter N. Skomoroch
  617.285.8348
  http://www.datawrangling.com
  http://delicious.com/pskomoroch
  http://twitter.com/peteskomoroch
 




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch


Re: Iterative feedback in map reduce....

2009-03-27 Thread Peter Skomoroch
Check out the EM example in nltk:

http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk_contrib/hadoop/EM/runStreaming.py

On Fri, Mar 27, 2009 at 5:19 PM, Sid123 itis...@gmail.com wrote:


 HI,
 I have to design an iterative algorithm, each iteration is a M-R cycle that
 calculates a parameter and has to feed it back to all the maps in the next
 iteration.
 Now the reduce procedure I need to just sum eveything from the Map
 procedure(Many similar size matrices) into a single matrix(of same size as
 each reduce ), irrespective of the key. This single matrix is the parameter
 I was taking about earlier.
 i want to know. PS This parameter MUST BE global to  all map processes.

 1) How do I collect all the values into one single parameter? Do I need to
 write it to the File system or can i keep it in memory? I feel that I WILL
 have to write it to the HDFS somewhere...
 --
 View this message in context:
 http://www.nabble.com/Iterative-feedback-in-map-reduce-tp22748317p22748317.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch