Re: running pig on amazon ec2

Dexin Wang Tue, 14 Jun 2011 10:56:01 -0700

Thanks for your feedback. My comments below.

On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <jiany...@yahoo-inc.com> wrote:


> Curious, couple of questions:
> 1. Are you running in local mode or mapreduce mode?
>
Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
ran it on ec2 cluster.

2. If mapreduce mode, did you look into the hadoop log to see how much slow
> down each mapreduce job does?
>
I'm looking into that.


> 3. What kind of query is it?
>
> The input is gzipped json files which has one event per line. Then I do
some hourly aggregation on the raw events, then do bunch of groupping,
joining and some metrics computing (like median, variance) on some fields.

Daniel
>
>  Someone mentioned it's EC2's I/O performance. But I'm sure there are
plenty of people using EC2/EMR running big MR jobs so more likely I have
some configuration issues? My jobs can be optimized a bit but the fact that
running on my laptop is faster tells me this is a separate issue.

Thanks!



> On 06/13/2011 11:54 AM, Dexin Wang wrote:
>
>> Hi,
>>
>> This is probably not directly a Pig question.
>>
>> Anyone running Pig on amazon EC2 instances? Something's not making sense
>> to
>> me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
>> cluster using m1.small. It took *13 minutes*. The job reads input from S3
>> and writes output to S3. But from the logs the reading and writing part
>> to/from S3 is pretty fast. And all the intermediate steps should happen on
>> HDFS.
>>
>> Running the same job on my mbp laptop, it only took *3 minutes*.
>>
>> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
>> on my laptop. Some hadoop config is probably also not ideal. I tried
>> m1.large instead of m1.small, doesn't seem to make a huge difference.
>> Anything you would suggest to look for the slowness on EC2?
>>
>> Dexin
>>
>
>

Re: running pig on amazon ec2

Reply via email to