Re: Hadoop & Python

2009-05-19 Thread Peter Skomoroch
Direct link to HADOOP-4842: https://issues.apache.org/jira/browse/HADOOP-4842 On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch wrote: > Whoops, should have googled it first. Looks like this is now fixed in > trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be &g

Re: Hadoop & Python

2009-05-19 Thread Peter Skomoroch
tched into distributions like EMR and Cloudera On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch wrote: > One area I'm curious about is the requirement that any combiners in > Streaming jobs be java classes. Are there any plans to change this in the > future? Prototyping streaming jobs in

Re: Hadoop & Python

2009-05-19 Thread Peter Skomoroch
One area I'm curious about is the requirement that any combiners in Streaming jobs be java classes. Are there any plans to change this in the future? Prototyping streaming jobs in Python is great, and the ability to use a Python combiner would help performance a lot without needing to move to Jav

Re: sort example

2009-05-16 Thread Peter Skomoroch
I just copy and pasted that comparator option from the docs, the -n part is what you want in this case. On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch wrote: > 1) It is doing alphabetical sort by default, you can force Hadoop streaming > to sort numerically with:

Re: sort example

2009-05-16 Thread Peter Skomoroch
1) It is doing alphabetical sort by default, you can force Hadoop streaming to sort numerically with: -D mapred.text.key.comparator.options=-k2,2nr\ see the section "A Useful Comparator Class" in the streaming docs: http://hadoop.apache.org/core/docs/current/streaming.html and https://issues.apa

Re: How to get jobconf variables in streaming's mapper/reducer?

2009-05-15 Thread Peter Skomoroch
It took me a while to track this down, Todd is half right (at least for 18.3)... mapred.task.partition actually turns into $mapred_task_partition (note it is lowercase) for example, to get the filename in the mapper of a python streaming job: -- import sys, os filename = os.environ["ma

Fast upload of input data to S3?

2009-05-14 Thread Peter Skomoroch
Does anyone have upload performance numbers to share or suggested utilities for uploading Hadoop input data to S3 for an EC2 cluster? I'm finding EBS volume transfer to HDFS via put to be extremely slow... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskom

Re: Hadoop / MySQL

2009-04-28 Thread Peter Skomoroch
Thanks for sharing sounds like a nice system - I always advise people to avoid direct SQL inserts for batch jobs / large amounts of data and use MySQL's optimized LOAD utility like you did. Same goes for Oracle... Nothing brings a DB server to its knees like a ton of individual inserts on indexed

Re: Hadoop and Matlab

2009-04-21 Thread Peter Skomoroch
If you can compile the matlab code to an executable with the matlab compiler and send it to the nodes with the distributed cache that should work... You probably want to avoid licensing fees for running copies of matlab itself on the cluster. Sent from my iPhone On Apr 21, 2009, at 1:55 PM

Re: Hadoop streaming performance: elements vs. vectors

2009-04-07 Thread Peter Skomoroch
6, 2009 at 12:17 AM, Amareshwari Sriramadasu < amar...@yahoo-inc.com> wrote: > You can add your jar to distributed cache and add it to classpath by > passing it in configuration propery - "mapred.job.classpath.archives". > > -Amareshwari > > Peter Skomoroch wrote:

Re: Amazon Elastic MapReduce

2009-04-06 Thread Peter Skomoroch
Intermediate results can be stored in hdfs on the EC2 machines, or in S3 using s3n... performance is better if you store on hdfs: "-input", "s3n://elasticmapreduce/samples/similarity/lastfm/input/", "-output","hdfs:///home/hadoop/output2/", On Mon, Apr 6, 2

Re: Hadoop streaming performance: elements vs. vectors

2009-04-05 Thread Peter Skomoroch
Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch wrote: > Paco, > > Thanks, good ideas on the combiner. I'm going to tweak things a bit as you > suggest and report back later... > > -Pete > > > On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN wrote: > >> hi peter

Re: Amazon Elastic MapReduce

2009-04-02 Thread Peter Skomoroch
Kevin, The API accepts any arguments you can pass in the standard jobconf for Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON job description that will run on the service. -Pete On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson wrote: > So if I understand correctly, t

Re: Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Peter Skomoroch
et invoked both at the end of the map phase and at > the beginning of the reduce phase (more benefit) > > also, using byte arrays if possible to represent values may be able to > save much shuffle time > > best, > paco > > > On Sat, Mar 28, 2009 at 01:51, Peter

Hadoop streaming performance: elements vs. vectors

2009-03-28 Thread Peter Skomoroch
Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers? For example

Re: Iterative feedback in map reduce....

2009-03-27 Thread Peter Skomoroch
Check out the EM example in nltk: http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk_contrib/hadoop/EM/runStreaming.py On Fri, Mar 27, 2009 at 5:19 PM, Sid123 wrote: > > HI, > I have to design an iterative algorithm, each iteration is a M-R cycle that > calculates a parameter and has t