Direct link to HADOOP-4842:
https://issues.apache.org/jira/browse/HADOOP-4842
On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch
wrote:
> Whoops, should have googled it first. Looks like this is now fixed in
> trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be
&g
tched into distributions like EMR and Cloudera
On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
wrote:
> One area I'm curious about is the requirement that any combiners in
> Streaming jobs be java classes. Are there any plans to change this in the
> future? Prototyping streaming jobs in
One area I'm curious about is the requirement that any combiners in
Streaming jobs be java classes. Are there any plans to change this in the
future? Prototyping streaming jobs in Python is great, and the ability to
use a Python combiner would help performance a lot without needing to move
to Jav
I just copy and pasted that comparator option from the docs, the -n part is
what you want in this case.
On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch wrote:
> 1) It is doing alphabetical sort by default, you can force Hadoop streaming
> to sort numerically with:
1) It is doing alphabetical sort by default, you can force Hadoop streaming
to sort numerically with:
-D mapred.text.key.comparator.options=-k2,2nr\
see the section "A Useful Comparator Class" in the streaming docs:
http://hadoop.apache.org/core/docs/current/streaming.html
and https://issues.apa
It took me a while to track this down, Todd is half right (at least for
18.3)...
mapred.task.partition actually turns into $mapred_task_partition (note it
is lowercase)
for example, to get the filename in the mapper of a python streaming job:
--
import sys, os
filename = os.environ["ma
Does anyone have upload performance numbers to share or suggested utilities
for uploading Hadoop input data to S3 for an EC2 cluster?
I'm finding EBS volume transfer to HDFS via put to be extremely slow...
--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskom
Thanks for sharing sounds like a nice system - I always advise people to
avoid direct SQL inserts for batch jobs / large amounts of data and use
MySQL's optimized LOAD utility like you did. Same goes for Oracle...
Nothing brings a DB server to its knees like a ton of individual inserts on
indexed
If you can compile the matlab code to an executable with the matlab
compiler and send it to the nodes with the distributed cache that
should work... You probably want to avoid licensing fees for running
copies of matlab itself on the cluster.
Sent from my iPhone
On Apr 21, 2009, at 1:55 PM
6, 2009 at 12:17 AM, Amareshwari Sriramadasu <
amar...@yahoo-inc.com> wrote:
> You can add your jar to distributed cache and add it to classpath by
> passing it in configuration propery - "mapred.job.classpath.archives".
>
> -Amareshwari
>
> Peter Skomoroch wrote:
Intermediate results can be stored in hdfs on the EC2 machines, or in S3
using s3n... performance is better if you store on hdfs:
"-input",
"s3n://elasticmapreduce/samples/similarity/lastfm/input/",
"-output","hdfs:///home/hadoop/output2/",
On Mon, Apr 6, 2
Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
wrote:
> Paco,
>
> Thanks, good ideas on the combiner. I'm going to tweak things a bit as you
> suggest and report back later...
>
> -Pete
>
>
> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN wrote:
>
>> hi peter
Kevin,
The API accepts any arguments you can pass in the standard jobconf for
Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON
job description that will run on the service.
-Pete
On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson wrote:
> So if I understand correctly, t
et invoked both at the end of the map phase and at
> the beginning of the reduce phase (more benefit)
>
> also, using byte arrays if possible to represent values may be able to
> save much shuffle time
>
> best,
> paco
>
>
> On Sat, Mar 28, 2009 at 01:51, Peter
Hadoop streaming question: If I am forming a matrix M by summing a number of
elements generated on different mappers, is it better to emit tons of lines
from the mappers with small key,value pairs for each element, or should I
group them into row vectors before sending to the reducers?
For example
Check out the EM example in nltk:
http://code.google.com/p/nltk/source/browse/trunk/nltk/nltk_contrib/hadoop/EM/runStreaming.py
On Fri, Mar 27, 2009 at 5:19 PM, Sid123 wrote:
>
> HI,
> I have to design an iterative algorithm, each iteration is a M-R cycle that
> calculates a parameter and has t
16 matches
Mail list logo