replicated join gets extra job

2013-11-11 Thread Dexin Wang
Hi,

I'm running a job like this:

raw_large = LOAD 'lots_of_files' AS (...);
raw_filtered = FILTER raw_large BY ...;
large_table = FOREACH raw_filtered GENERATE f1, f2, f3,;

joined_1 = JOIN large_table BY (key1) LEFT, config_table_1  BY (key2) USING
'replicated';
joined_2 = JOIN join1  BY (key3) LEFT, config_table_2  BY (key4)
USING 'replicated';
joined_3 = JOIN join2  BY (key5) LEFT, config_table_3  BY (key6)
USING 'replicated';
joined_4 = JOIN join4  BY (key7) LEFT, config_table_3  BY (key8)
USING 'replicated';

basically left join a large table with 4 relatively small tables using the
replicated join.

I see a first load job has 120 mapper tasks and no reducer, and this job
seems to be doing the load and filtering. And there is another job
following that has 26 mapper tasks that seem to be doing the joins.

Shouldn't there be only one job and the joins being done in the mapper
phase of the first job?

The 4 config tables (files) have these sizes respectively:

3MB
220kB
2kB
100kB

these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB
memory.

Thanks!


Re: python version with Jython/Pig

2013-07-18 Thread Dexin Wang
Thanks.

Instead, I found a Python implementation of the erf function, so that'll be
good for now.

http://stackoverflow.com/questions/457408/is-there-an-easily-available-implementation-of-erf-for-python


On Wed, Jul 17, 2013 at 5:08 PM, Cheolsoo Park piaozhe...@gmail.com wrote:

 Hi Dexin,

 Unfortunately, Pig is on Jython 2.5, so you won't be able to use Python 2.7
 modules.

 A while back, someone posted a hack to get Jython 2.7-b1 working with Pig.
 You might give it a try:

 http://search-hadoop.com/m/BnZs3MmH5y/jython+2.7subj=informational+getting+jython+2+7+b1+to+work

 Thanks,
 Cheolsoo





 On Wed, Jul 17, 2013 at 3:33 PM, Dexin Wang wangde...@gmail.com wrote:

  When I do Python UDF with Pig, how do we know which version of Python it
 is
  using? Is it possible to use a specific version of Python?
 
  Specifically my problem is in my UDF, I need to use a function in math
  module math.erf() which is newly introduced in Python version 2.7. I have
  Python 2.7 installed on my machine and standalone Python program runs
 fine
  but when I run it in Pig as Python UDF, I got this:
 
  AttributeError: type object 'org.python.modules.math' has no attribute
  'erf'
 
  My guess is Jython is using some pre-2.7 version of Python?
 
  Thanks for your help!
 
  Dexin
 



python version with Jython/Pig

2013-07-17 Thread Dexin Wang
When I do Python UDF with Pig, how do we know which version of Python it is
using? Is it possible to use a specific version of Python?

Specifically my problem is in my UDF, I need to use a function in math
module math.erf() which is newly introduced in Python version 2.7. I have
Python 2.7 installed on my machine and standalone Python program runs fine
but when I run it in Pig as Python UDF, I got this:

AttributeError: type object 'org.python.modules.math' has no attribute 'erf'

My guess is Jython is using some pre-2.7 version of Python?

Thanks for your help!

Dexin


Re: reference tuple field by name in UDF

2013-02-01 Thread Dexin Wang
Thanks. That'll be nice unfortunately we are using EMR which only has 0.9
so that's not an option for us.

Similar question for Python UDF. In my Python UDF, is referencing field by
index (instead of alias) is the only option I have?


On Tue, Jan 15, 2013 at 2:20 PM, Jonathan Coveney jcove...@gmail.comwrote:

 Another way to do it would be to make a helper function that does the
 following:

 input.get(getInputSchema().getPosition(alias));

 Only available in 0.10 and later (I think getInputSchema is in 0.10, at
 least...may only be in 0.11)


 2013/1/15 Dexin Wang wangde...@gmail.com

  Hi,
 
  In my own UDF, is reference a field by index the only way to access a
  field?
 
  The fields are all named and typed before passing into UDF but looks
 like I
  can only do something like this:
 
 String v1 = (String)input.get(0);
 String v2 = (String)input.get(1);
 String v3 = (String)input.get(2);
 
  instead I'd like to do something like this:
 
String v1 = (String)input.get(f1);
String v2 = (String)input.get(f2);
String v3 = (String)input.get(f3);
 
  since I have lots of field and I don't want to tie myself up the
  positioning of the fields.
 
  Any alternative? Thanks.
 
  Dexin
 



reference tuple field by name in UDF

2013-01-15 Thread Dexin Wang
Hi,

In my own UDF, is reference a field by index the only way to access a field?

The fields are all named and typed before passing into UDF but looks like I
can only do something like this:

   String v1 = (String)input.get(0);
   String v2 = (String)input.get(1);
   String v3 = (String)input.get(2);

instead I'd like to do something like this:

  String v1 = (String)input.get(f1);
  String v2 = (String)input.get(f2);
  String v3 = (String)input.get(f3);

since I have lots of field and I don't want to tie myself up the
positioning of the fields.

Any alternative? Thanks.

Dexin


Re: Passing a BAG to Pig UDF constructor?

2012-06-27 Thread Dexin Wang
Actually how do you pass a bag to UDF? I did this:

a = LOAD 'file_a' AS (a1, a2, a3);

*bag1* = LOAD 'somefile' AS (f1, f2, f3);

b = FOREACH a GENERATE myUDF(*bag1*, a1, a2);

But I got this error:

 Invalid scalar projection: bag1 : A column needs to be projected from
a relation for it to be used as a scalar

What is the right way of doing this? Thanks.


On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang wangde...@gmail.com wrote:

 That's a good idea (to pass the bag to UDF and initialize it on first UDF
 invocation). Thanks.

 Why do you think it is expensive Mridul?


 On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan 
 mrid...@yahoo-inc.com wrote:



  -Original Message-
  From: Jonathan Coveney [mailto:jcove...@gmail.com]
  Sent: Wednesday, June 27, 2012 3:12 AM
  To: user@pig.apache.org
  Subject: Re: Passing a BAG to Pig UDF constructor?
 
  You can also just pass the bag to the UDF, and have a lazy initializer
  in exec that loads the bag into memory.


 Can you elaborate what you mean by pass the bag to the UDF ?
 Pass it as part of the input to the udf in exec and initialize it only
 once (first time) ? (If yes, this is expensive)
 Or something else ?


 Regards,
 Mridul



 
  2012/6/26 Mridul Muralidharan mrid...@yahoo-inc.com
 
   You could dump the data in a dfs file and pass the location of the
   file as param to your udf in define - so that it initializes itself
   using that data ...
  
  
   - Mridul
  
  
-Original Message-
From: Dexin Wang [mailto:wangde...@gmail.com]
Sent: Tuesday, June 26, 2012 10:58 PM
To: user@pig.apache.org
Subject: Passing a BAG to Pig UDF constructor?
   
Is it possible to pass a bag to a Pig UDF constructor?
   
Basically in the constructor I want to initialize some hash map so
that on every exec operation, I can use the hashmap to do a lookup
and find the value I need, and apply some algorithm to it.
   
I realize I could just do a replicated join to achieve similar
things but the algorithm is more than a few lines and there are
  some
edge cases so I would rather wrap that logic inside a UDF function.
I also realize I could just pass a file path to the constructor and
read the files to initialize the hashmap but my files are on
Amazon's S3 and I don't want to deal with
S3 API to read the file.
   
Is this possible or is there some alternative ways to achieve the
same thing?
   
Thanks.
Dexin
  





Passing a BAG to Pig UDF constructor?

2012-06-26 Thread Dexin Wang
Is it possible to pass a bag to a Pig UDF constructor?

Basically in the constructor I want to initialize some hash map so that on
every exec operation, I can use the hashmap to do a lookup and find the
value I need, and apply some algorithm to it.

I realize I could just do a replicated join to achieve similar things but
the algorithm is more than a few lines and there are some edge cases so I
would rather wrap that logic inside a UDF function. I also realize I could
just pass a file path to the constructor and read the files to initialize
the hashmap but my files are on Amazon's S3 and I don't want to deal with
S3 API to read the file.

Is this possible or is there some alternative ways to achieve the same
thing?

Thanks.
Dexin


Re: help!!--Does Pig can be use in this way?!

2012-04-04 Thread Dexin Wang
Or if it's simple like that, why not just grep?

On Wed, Apr 4, 2012 at 7:07 AM, Corbin Hobus cor...@tynt.com wrote:

 If you are just finding the age of one person you are much better off
 using a regular database and SQL or hbase of you need some kind of quick
 random access.

 Hadoop/pig is for batch calculations where you need to sift through lots
 of data.

 Sent from my iPhone

 On Apr 4, 2012, at 6:35 AM, 凯氏图腾 t...@tengs.info wrote:

  Hello!
  I wonder if Pig can be use in this way:I have many records,each record
 content name and age,now I want to find out one person's age.I know it is
 easy to make it in Pig Latin, but I just wonder whether it is a good idea
 to select one record in this way or it can't be better than to make it in
 SQL?
  Thanks!!



Re: filter out null lines returned by UDF

2012-03-07 Thread Dexin Wang
yeah. That works great. Thanks you Jonathan.

On Thu, Mar 1, 2012 at 5:14 PM, Jonathan Coveney jcove...@gmail.com wrote:

 FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but
 if you FLATTEN a bag that is empty (ie size=0), it will throw away the row.
 I would have your UDF return an empty bag and let the flatten wipe it out.

 2012/3/1 Dexin Wang wangde...@gmail.com

  Hi,
 
  I have a UDF that parses a line and then return a bag, and sometimes the
  line is bad so I'm returning null in the UDF. In my pig script, I'd like
 to
  filter those nulls like this:
 
  raw = LOAD 'raw_input' AS (line:chararray);
  parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));-- get two fields
 in
  the tuple: id and name
  DUMP parsed;
 
(id1,name1)
(id2,name2)
()
(id3,name3)
 
  parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
  DUMP parsed_no_nulls;
 
(id1,name1)
(id2,name2)
(id3,name3)
 
  This works, but I'm getting this warning:
 
   WARN
 
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
  -
 
 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
  Attempt to access field which was not found in the input
 
  When I try to use IsEmpty to filter, I get this error Cannot test a NULL
  for emptiness.
 
  What's the correct way to filter out these null bags returned from my
 UDF?
 
  Thanks.
  Dexin
 



filter out null lines returned by UDF

2012-03-01 Thread Dexin Wang
Hi,

I have a UDF that parses a line and then return a bag, and sometimes the
line is bad so I'm returning null in the UDF. In my pig script, I'd like to
filter those nulls like this:

raw = LOAD 'raw_input' AS (line:chararray);
parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));-- get two fields in
the tuple: id and name
DUMP parsed;

   (id1,name1)
   (id2,name2)
   ()
   (id3,name3)

parsed_no_nulls = FILTER parsed BY id IS NOT NULL;
DUMP parsed_no_nulls;

   (id1,name1)
   (id2,name2)
   (id3,name3)

This works, but I'm getting this warning:

 WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger
-
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
Attempt to access field which was not found in the input

When I try to use IsEmpty to filter, I get this error Cannot test a NULL
for emptiness.

What's the correct way to filter out these null bags returned from my UDF?

Thanks.
Dexin


Re: pig 0.9 slower in local mode?

2011-12-21 Thread Dexin Wang
Cool. Looking forward to trying out 0.9.2+patch.

On Mon, Dec 19, 2011 at 2:31 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Yes, starting with 0.7 local mode was moved to sit on top of Hadoop's
 local mode (instead of being a completely separate implementation).
 This had two effects, one desirable, one not so much:

 1) get rid of surprises / bugs caused by having to support 2
 completely separate implementations of the runtime
 2) slow the thing down a lot.

 Julien sped up local mode quite a bit in 0.9.2, and we have another
 patch coming that further removes unnecessary slowness, but I'm afraid
 hadoop-based local mode will never be quite as fast as the old
 local-mode...

 D

 On Mon, Dec 19, 2011 at 2:23 PM, Dexin Wang wangde...@gmail.com wrote:
  I recently switched to pig 0.9.1 and noticed it runs slower than previous
  version (like 0.6 which was only recent version supported on Amazon
 couple
  of months ago) in local mode. Haven't tried the timing in hadoop mode
 yet.
 
  I figure it is probably due to some extra debugging or some parameter.
  Anything I can do to make it faster?
 
  Thanks,
  Dexin



pig 0.9 slower in local mode?

2011-12-19 Thread Dexin Wang
I recently switched to pig 0.9.1 and noticed it runs slower than previous
version (like 0.6 which was only recent version supported on Amazon couple
of months ago) in local mode. Haven't tried the timing in hadoop mode yet.

I figure it is probably due to some extra debugging or some parameter.
Anything I can do to make it faster?

Thanks,
Dexin


Re: multiple folder loading or passing comma on parameter with Amazon Pig

2011-08-18 Thread Dexin Wang
I will.

There is also a bug on Pig documentation here:

http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html

where it says

   In this example the command is executed and its stdout is used as the
parameter value.

  %declare CMD 'generate_date';

it should really be `generate_date` with the back ticks, not the single
quotes.

On Wed, Aug 17, 2011 at 6:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Nice job figuring out a fix!
 You should seriously file a bug with AMR for that. That's kind of
 ridiculous.

 D

 On Wed, Aug 17, 2011 at 6:03 PM, Dexin Wang wangde...@gmail.com wrote:

  I solved my own problem and just want to share with whoever might
 encounter
  the same issue.
 
  I pass colon separated list then convert it to comma separated list
 inside
  pig script using declare command.
 
  Submit pig job  like this:
 
  -p SOURCE_DIRS=2011-08:2011-07:2011-06
 
  and in Pig script
 
  % declare SOURCE_DIRS_CONVERTED  `echo $SOURCE_DIRS | tr ':' ','`;
  LOAD '/root_dir/{$SOURCE_DIRS_CONVERTED}' ...
 
 
  On Wed, Aug 17, 2011 at 4:21 PM, Dexin Wang wangde...@gmail.com wrote:
 
   Hi,
  
   I'm running pig jobs using Amazon pig support, where you submit jobs
 with
   comma concatenated parameters like this:
  
elastic-mapreduce --pig-script --args myscript.pig --args
   -p,PARAM1=value1,-p,PARAM2=value2,-p,PARAM3=value3
  
   In my script, I need to pass multiple directories for the pig script to
   load like this:
  
raw = LOAD '/root_dir/{$SOURCE_DIRS}'
  
   and SOURCE_DIRS is computed. For example, it can be
   2011-08,2011-07,20110-06, meaning my pig script need to load data for
  the
   past 3 months. This works fine when I run my job using local or direct
   hadoop mode. But with Amazon pig, I have to do something like this:
  
elastic-mapreduce --pig-script --args myscript.pig
   -p,SOURCE_DIRS=2011-08,2011-07,2011-06
  
   but emr will just replace commas with spaces so it breaks the parameter
   passing syntax. I've tried adding backslashes before commas, but I
 simply
   end up with back slash with space in between.
  
   So question becomes:
  
   1. can I do something differently than what I'm doing to pass multiple
   folders to pig script (without commas), or
   2. anyone knows how to properly pass commas to elastic-mapreduce ?
  
   Thanks!
  
   Dexin
  
 



conditional and multiple generate inside foreach?

2011-07-22 Thread Dexin Wang
Possible to do conditional and more than one generate inside a foreach?

for example, I have tuples like this (names, days_ago)

(a,0)
(b,1)
(c,9)
(d,40)

b shows up 1 day ago, so it belongs to all of the following: yesterday, last
week, last month, and last quarter. So I'd like to turn the above to:

(a,0,today)
(b,1,yesterday)
(b,1,week)
(b,1,month)
(b,1,quarter)
(c,9,month)
(c,9,quarter)
(d,40,quarter)

I imagine/dream I could do something like this

B = FOREACH A
  {
if (days_ago = 90) generate name,days_ago,'quarter';
if (days_ago = 30) generate name,days_ago,'month';
if (days_ago = 7)   generate name,days_ago,'week';
if (days_ago == 1)   generate name,days_ago,'yesterday';
if (days_ago == 0)   generate name,days_ago,'today';
  }

of course that's not valid syntax. I could write my own UDF but would be
nice there's some way to get what I want without UDF.

Thanks!
Dexin


Re: why the udf can not work

2011-06-18 Thread Dexin Wang
You need to have your class file in this path

/home/huyong/test/myudfs/UPPER.class

since it's in myudfs directory. 


On Jun 18, 2011, at 12:33 PM, 勇胡 yongyong...@gmail.com wrote:

 I tried your command and then it shows me as following:
 /home/huyong/test/UPPER.class
 /home/huyong/test/UPPER.java
 
 Yong
 在 2011年6月18日 下午4:29,Dmitriy Ryaboy dvrya...@gmail.com写道:
 
 This usually hapens when you aren't registering what you think you are
 registering.
 try `jar tf /home/huyong/test/myudfs.jar | grep UPPER` and see if you
 get anything.
 
 D
 
 2011/6/18 勇胡 yongyong...@gmail.com:
 Hi,
 
 package myudfs;
 import java.io.IOException;
 import org.apache.pig.EvalFunc;
 import org.apache.pig.data.Tuple;
 import org.apache.pig.impl.util.*;
 
 public class UPPER extends EvalFuncString
 {
   public String exec(Tuple input) throws IOException {
   if (input == null || input.size() == 0)
   return null;
   try{
   String str = (String)input.get(0);
   return str.toUpperCase();
   }catch(Exception e){
   throw new IOException(e);
   }
   }
 }
 
 This is as same as the example from the Pig website. By the way, I also
 added the PIG_CLASS. But it still didn't work.
 
 Yong
 
 2011/6/18 Jonathan Coveney jcove...@gmail.com
 
 Can you paste the content of the UDF?
 
 2011/6/18 勇胡 yongyong...@gmail.com
 
 Hello,
 
 I just tried the example from the pig udf manual step by step. But I
 got
 the
 error information. Can anyone tell me how to solve it?
 
 grunt REGISTER /home/huyong/test/myudfs.jar;
 grunt A = LOAD '/home/huyong/test/student.txt' as (name:chararray);
 grunt B = FOREACH A GENERATE myudfs.UPPER(name);
 2011-06-18 11:15:38,892 [main] ERROR org.apache.pig.tools.grunt.Grunt
 -
 ERROR 1070: Could not resolve myudfs.UPPER using imports: [,
 org.apache.pig.builtin., org.apache.pig.impl.builtin.]
 Details at logfile: /home/huyong/test/pig_1308388238352.log
 
 I have already registered the udf, why pig tries to search from the
 builtin
 path.
 
 Thanks for your help!
 
 Yong Hu
 
 
 
 


Re: pig script takes much longer than java MR job

2011-06-17 Thread Dexin Wang
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can 
take long time. 
 
I once forgot to comment out some debug line in my udf. When run with 
production data, not only it's slow, it blew up the cluster - simply run out of 
log space :)

On Jun 17, 2011, at 5:06 PM, Jonathan Coveney jcove...@gmail.com wrote:

 A couple of possibilities that I'm kicking around off the top of my head...
 
 1) Does your MR job also sort afterwards? That's going to kick off another
 MR job
 2) Does your MR job compile all the results into one job?
 
 My guess is the Order+Dump are making it take longer.
 
 2011/6/17 Sujee Maniyam su...@sujee.net
 
 I have log files like this:
  #timestamp (ms), server,user,action,domain , x,y ,
 z
  126233288, 7, 50817, 2, yahoo.com, 31, blahblah, foobar
  1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar
  1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar
 
 I have the following pig script to count the number of domains from logs. (
 For example, we have seen facebook.com 10 times ..etc.)
 
 Here is the pig script:
 
 
 records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long,
 server:int, user:int, action_id:int, domain:chararray, price:int);
 
 -- DUMP records;
 grouped_by_domain = GROUP records BY domain;
 -- DUMP grouped_by_domain;
 -- DESCRIBE grouped_by_domain;
 
 freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records)
 as
 mycount;
 -- DESCRIBE freq;
 -- DUMP freq;
 
 sorted = ORDER freq BY mycount DESC;
 DUMP sorted;
 
 
 This script takes a hour to run.   I also wrote a simple Java MR job to
 count the domains, it takes about 15 mins.  So the pig script is taking 4x
 longer to complete.
 
 any suggestions on what I am doing wrong in pig?
 
 thanks
 Sujee
 http://sujee.net
 


Re: running pig on amazon ec2

2011-06-15 Thread Dexin Wang
Thanks a lot for the good advice.

I'll see if I can get lzo setup. Currently I'm using emr which uses pig 0.6.
I'll looking into whirr to start the hadoop cluster on ec2.

There is one place in my job where I can use replicated join, I'm sure that
will cut down some time.

What I find interesting is without doing any optimization on configuration
or code side, I get 2x to 4x speed up by just using the *Cluster Compute
Quadruple Extra Large Instance* (cc1.4xlarge) as oppose to the regular
Large instance (m1.large) on the $$. They do claim cc1.4xlarge's IO is
very high. Since I suspect most of my job was spending time
reading/writing disk, this speedup makes sense.

On Wed, Jun 15, 2011 at 6:46 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 you need to add this to your pig.properties:

 pig.tmpfilecompression=true
 pig.tmpfilecompression.codec=lzo

 Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
 higher, and that all the lzo stuff is set up -- it's a bit involved.

 Use replicated joins where possible.

 If you are doing a large number of small jobs, scheduling and
 provisioning is likely to dominate -- tune your job scheduler to
 schedule more tasks per heartbeat and make sure your jar is as small
 as you can get it (there's a lot of unjarring going on in Hadoop)
 D

 On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang wangde...@gmail.com wrote:
  Tomas,
 
  What worked well for me is still to be figured out. Right now, it works
 but
  it's too slow. I think one of the main problem is that my job has many
  JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk
 which
  is slow.
 
  On that node, anyone knows how to know if the lzo is turned on for
  intermediate jobs. Reference to this
 
 
 http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs
 
  and this
 
 
 http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
 
  I see I have this in my mapred-site.xml file:
 
 propertynamemapred.map.output.compression.codec/name
  valuecom.hadoop.compression.lzo.LzoCodec/value/property
 
  Is that all I need to have map compression turned on? Thanks.
 
  Dexin
 
  On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
  svarovsky.to...@gmail.comwrote:
 
  Hi Dexin,
 
  Since I am being a Pig and map reduce newbie your post is very
  intriguing for me. I am coming from Talend background and trying to
  asses if map/reduce would bring any possible speed up and faster
  turnaround to my projects. My worries are that my data are to small so
  that map reduce overhead will be prohibitive in certain cases.
 
  When using Talend if the transformation was reasonable it could
  process 10s of thousand rows per second. Processing 1 million rows
  could be finished well under 1 minute so I think that your dataset is
  fairly small. Nevertheless my data are growing so soon it wil be time
  for pig.
 
  Could you provide some info what worked well for you to run your job on
  EC2?
 
  Thanks in advance,
 
  Tomas
 
  On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai jiany...@yahoo-inc.com
  wrote:
   If the job finishes in 3 minutes in local mode, I would think it is
  small.
  
   On 06/14/2011 11:07 AM, Dexin Wang wrote:
  
   Good to know. Trying single node hadoop cluster now. The main input
 is
   about 1+ million lines of events. After some aggregation, it joins
 with
   another input source which has also about 1+ million rows. Is this
   considered small query? Thanks.
  
   On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai jiany...@yahoo-inc.com
   mailto:jiany...@yahoo-inc.com wrote:
  
  Local mode and mapreduce mode makes a huge difference. For a small
  query, the mapreduce overhead will dominate. For a fair
  comparison, can you setup a single node hadoop cluster on your
  laptop and run Pig on it?
  
  Daniel
  
  
  On 06/14/2011 10:54 AM, Dexin Wang wrote:
  
  Thanks for your feedback. My comments below.
  
  On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai
  jiany...@yahoo-inc.com mailto:jiany...@yahoo-inc.com wrote:
  
  Curious, couple of questions:
  1. Are you running in local mode or mapreduce mode?
  
  Local mode (-x local) when I ran it on my laptop, and mapreduce
  mode when I ran it on ec2 cluster.
  
  2. If mapreduce mode, did you look into the hadoop log to see
  how much slow down each mapreduce job does?
  
  I'm looking into that.
  
  3. What kind of query is it?
  
  The input is gzipped json files which has one event per line.
  Then I do some hourly aggregation on the raw events, then do
  bunch of groupping, joining and some metrics computing (like
  median, variance) on some fields.
  
  Daniel
  
   Someone mentioned it's EC2's I/O performance. But I'm sure there
  are plenty of people using EC2/EMR running big MR jobs so more
  likely I have some configuration issues? My jobs can

Re: running pig on amazon ec2

2011-06-14 Thread Dexin Wang
Thanks for your feedback. My comments below.

On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai jiany...@yahoo-inc.com wrote:

 Curious, couple of questions:
 1. Are you running in local mode or mapreduce mode?

Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
ran it on ec2 cluster.

2. If mapreduce mode, did you look into the hadoop log to see how much slow
 down each mapreduce job does?

I'm looking into that.


 3. What kind of query is it?

 The input is gzipped json files which has one event per line. Then I do
some hourly aggregation on the raw events, then do bunch of groupping,
joining and some metrics computing (like median, variance) on some fields.

Daniel

  Someone mentioned it's EC2's I/O performance. But I'm sure there are
plenty of people using EC2/EMR running big MR jobs so more likely I have
some configuration issues? My jobs can be optimized a bit but the fact that
running on my laptop is faster tells me this is a separate issue.

Thanks!



 On 06/13/2011 11:54 AM, Dexin Wang wrote:

 Hi,

 This is probably not directly a Pig question.

 Anyone running Pig on amazon EC2 instances? Something's not making sense
 to
 me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
 cluster using m1.small. It took *13 minutes*. The job reads input from S3
 and writes output to S3. But from the logs the reading and writing part
 to/from S3 is pretty fast. And all the intermediate steps should happen on
 HDFS.

 Running the same job on my mbp laptop, it only took *3 minutes*.

 Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
 on my laptop. Some hadoop config is probably also not ideal. I tried
 m1.large instead of m1.small, doesn't seem to make a huge difference.
 Anything you would suggest to look for the slowness on EC2?

 Dexin





running pig on amazon ec2

2011-06-13 Thread Dexin Wang
Hi,

This is probably not directly a Pig question.

Anyone running Pig on amazon EC2 instances? Something's not making sense to
me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
cluster using m1.small. It took *13 minutes*. The job reads input from S3
and writes output to S3. But from the logs the reading and writing part
to/from S3 is pretty fast. And all the intermediate steps should happen on
HDFS.

Running the same job on my mbp laptop, it only took *3 minutes*.

Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6
on my laptop. Some hadoop config is probably also not ideal. I tried
m1.large instead of m1.small, doesn't seem to make a huge difference.
Anything you would suggest to look for the slowness on EC2?

Dexin


Re: Setting the store file name with date

2011-05-23 Thread Dexin Wang
I don't think version is a problem. variables is probably supported from the
start of the Pig.

Using

STORE result INTO 'out-$date';

I mentioned about, when you run the pig script, you just add -param
date=20110522' to your command line. What is the problem you see?

2011/5/21 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com

 Thanks Dexin! I tried that but that did not work for me. I am using
 Pig 0.7, what version of Pig do you use?
 Yeah I think I could move it aside, but the problem is that I need to
 keep track of the results, and if I move them aside then I would have
 to rename the results of each job sequentially because my jobs can
 repeat many times, but their results are different.
 Thanks again.

 Renato M.

 2011/5/20 Dexin Wang wangde...@gmail.com:
  Yeah I do that all the time.
 
  STORE result INTO 'out-$date';
 
  Or you could run the pig script then after it's done move the result
 aside.
 
 
  On May 20, 2011, at 6:51 PM, Renato Marroquín Mogrovejo
 renatoj.marroq...@gmail.com wrote:
 
  Hi, I have a sequence of jobs which are run daily and usually the logs
  and results are erased every time they have to be re-run. Now we want
  to keep those logs and results, but if the results already exist, the
  pig job fails. I thought that maybe setting the results' name + date
  would solve it for me. Can I do that from pig? Do you guys have any
  other suggestions?
  Thanks in advance.
 
 
  Renato M.
 



Re: Setting the store file name with date

2011-05-20 Thread Dexin Wang
Yeah I do that all the time. 

STORE result INTO 'out-$date';

Or you could run the pig script then after it's done move the result aside. 


On May 20, 2011, at 6:51 PM, Renato Marroquín 
Mogrovejorenatoj.marroq...@gmail.com wrote:

 Hi, I have a sequence of jobs which are run daily and usually the logs
 and results are erased every time they have to be re-run. Now we want
 to keep those logs and results, but if the results already exist, the
 pig job fails. I thought that maybe setting the results' name + date
 would solve it for me. Can I do that from pig? Do you guys have any
 other suggestions?
 Thanks in advance.
 
 
 Renato M.


elephantbird JsonLoader doesn't like gz?

2011-05-18 Thread Dexin Wang
Hi,

Anyone using Twitter's elephantbird library? I was using its JsonLoader and
got this error:

WARN  com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode
string
Unexpected character () at position 0.
at org.json.simple.parser.Yylex.yylex(Unknown Source)
at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
 at org.json.simple.parser.JSONParser.parse(Unknown Source)
at org.json.simple.parser.JSONParser.parse(Unknown Source)

But if I manually gunzip the file to a clear text json file, JsonLoader
works fine.

Again this fails:

raw_json = LOAD 'cc.json.gz' USING
com.twitter.elephantbird.pig.load.JsonLoader();

this works:

$ gunzip cc.json.gz
raw_json = LOAD 'cc.json' USING
com.twitter.elephantbird.pig.load.JsonLoader();

Any suggestions for this? Or is there any other json loader library out
there? I can write my own but would rather use one if already exists.

Thanks,

Dexin


Re: elephantbird JsonLoader doesn't like gz?

2011-05-18 Thread Dexin Wang
Or is it because I'm using Pig 0.6 where gz format is not supported? I'll
run this on aws EMR which only pig 0.6 is supported. I have to use later
version of Pig?

On Wed, May 18, 2011 at 11:12 AM, Dexin Wang wangde...@gmail.com wrote:

 Hi,

 Anyone using Twitter's elephantbird library? I was using its JsonLoader and
 got this error:

 WARN  com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode
 string
 Unexpected character () at position 0.
 at org.json.simple.parser.Yylex.yylex(Unknown Source)
 at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
  at org.json.simple.parser.JSONParser.parse(Unknown Source)
 at org.json.simple.parser.JSONParser.parse(Unknown Source)

 But if I manually gunzip the file to a clear text json file, JsonLoader
 works fine.

 Again this fails:

 raw_json = LOAD 'cc.json.gz' USING
 com.twitter.elephantbird.pig.load.JsonLoader();

 this works:

 $ gunzip cc.json.gz
 raw_json = LOAD 'cc.json' USING
 com.twitter.elephantbird.pig.load.JsonLoader();

 Any suggestions for this? Or is there any other json loader library out
 there? I can write my own but would rather use one if already exists.

 Thanks,

 Dexin



Re: elephantbird JsonLoader doesn't like gz?

2011-05-18 Thread Dexin Wang
Turns out it's only a problem if I run it in local mode, running it in
cluster doesn't have this problem. I'm using EB1.2.5.

Wonder how you fix the problem since it seems it's not EB problem. Or are
you gunzipping it in EB load function?

On Wed, May 18, 2011 at 8:43 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Which version of EB are you using? I recently fixed this for someone,
 I believe it's been in every version since 1.2.3

 D

 On Wed, May 18, 2011 at 11:26 AM, Dexin Wang wangde...@gmail.com wrote:
  Or is it because I'm using Pig 0.6 where gz format is not supported? I'll
  run this on aws EMR which only pig 0.6 is supported. I have to use later
  version of Pig?
 
  On Wed, May 18, 2011 at 11:12 AM, Dexin Wang wangde...@gmail.com
 wrote:
 
  Hi,
 
  Anyone using Twitter's elephantbird library? I was using its JsonLoader
 and
  got this error:
 
  WARN  com.twitter.elephantbird.pig.load.JsonLoader - Could not
 json-decode
  string
  Unexpected character () at position 0.
  at org.json.simple.parser.Yylex.yylex(Unknown Source)
  at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
   at org.json.simple.parser.JSONParser.parse(Unknown Source)
  at org.json.simple.parser.JSONParser.parse(Unknown Source)
 
  But if I manually gunzip the file to a clear text json file, JsonLoader
  works fine.
 
  Again this fails:
 
  raw_json = LOAD 'cc.json.gz' USING
  com.twitter.elephantbird.pig.load.JsonLoader();
 
  this works:
 
  $ gunzip cc.json.gz
  raw_json = LOAD 'cc.json' USING
  com.twitter.elephantbird.pig.load.JsonLoader();
 
  Any suggestions for this? Or is there any other json loader library out
  there? I can write my own but would rather use one if already exists.
 
  Thanks,
 
  Dexin
 
 



Re: reducer throttling?

2011-03-24 Thread Dexin Wang
Thanks for your explanation Alex.

In some cases, there isn't even a reduce phase. For example, we have some
raw data, after our custom LOAD function and some filter function, it
directly goes into DB. And since we don't have control on number of mappers,
we end up with too many DB writers. That's why I had to add that artificial
reduce phase I mentioned earlier so that we can throttle it down.

We could also do what someone else suggested - add a post process step that
writes output to HDFS and load DB from that. But there are other
considerations that we'd like not to do that if we don't have to.

On Thu, Mar 17, 2011 at 2:16 PM, Alex Rovner alexrov...@gmail.com wrote:

 Dexin,

 You can control the amount of reducers by adding the following in your pig
 script:

 SET default_parallel 29;

 Pig will run with 29 reducers with the above statement.

 As far as the bulk insert goes:

 We are using MS-SQL as our database, but MySQL would be able to handle the
 bulk insert the same way.

 Essentially we are directing the output of the job into a temporary folder
 in order to know the output of this particular run. If you set the amount of
 reducers to 29, you will have 29 files in the temp folder after the job
 completes. You can then run a bulk insert SQL command on each of the
 resulting files with pointing to HDFS either through FUSE(The way we do it)
 or you can copy the resulting files to a samba share or NFS and point the
 SQL server to that location.

 In order to bulk insert you would have to either A. Do this in a post
 processing script or write your own storage func that takes care of this.
 Storage func is tricky since you will need to implement your own
 outputcommiter (See https://issues.apache.org/jira/browse/PIG-1891)

 Let me know if you have further questions.

 Alex


 On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang wangde...@gmail.com wrote:

 Can you describe a bit more about your bulk insert technique? And the way
 you control the number of reducers is also by adding artificial ORDER or
 GROUP step?

 Thanks!


 On Thu, Mar 17, 2011 at 1:33 PM, Alex Rovner alexrov...@gmail.comwrote:

 We use bulk insert technique after the job completes. You can control the
 amount of each bulk insert by controlling the amount of reducers.

 Sent from my iPhone

 On Mar 17, 2011, at 2:03 PM, Dexin Wang wangde...@gmail.com wrote:

  We do some processing in hadoop then as the last step, we write the
 result
  to database. Database is not good at handling hundreds of concurrent
  connections and fast writes. So we need to throttle down the number of
 tasks
  that writes to DB. Since we have no control on the number of mappers,
 we add
  an artificial reducer step to achieve that, either by doing GROUP or
 ORDER,
  like this:
 
  sorted_data = ORDER data BY f1 PARALLEL 10;
  -- then write sorted_data to DB
 
  or
 
  grouped_data = GROUP data BY f1 PARALLEL 10;
  data_to_write = FOREACH grouped_data GENERATE $1;
 
  I feel neither is good approach. They just add unnecessary computing
 time,
  especially the first one. And GROUP may result in too large of bags
 issue.
 
  Any better suggestions?






possibly Pig throttles the number of mappers

2011-03-23 Thread Dexin Wang
Hi,

We've seen a strange problem where some Pig jobs would just run fewer
mappers concurrently than the mapper capacity. Specifically we have a 10
node cluster and each is configured to have 12 mappers. Normally we have 120
mappers running. But for some Pig jobs it will only have 10 mappers running
(while nothing else is running), and actually appears to be 1 mapper per
node.

We have not noticed the same problem with other non-Pig hadoop job. Anyone
has experienced the same thing and have any explanation or remedy?

Thanks!
Dexin


reducer throttling?

2011-03-17 Thread Dexin Wang
We do some processing in hadoop then as the last step, we write the result
to database. Database is not good at handling hundreds of concurrent
connections and fast writes. So we need to throttle down the number of tasks
that writes to DB. Since we have no control on the number of mappers, we add
an artificial reducer step to achieve that, either by doing GROUP or ORDER,
like this:

sorted_data = ORDER data BY f1 PARALLEL 10;
-- then write sorted_data to DB

or

grouped_data = GROUP data BY f1 PARALLEL 10;
data_to_write = FOREACH grouped_data GENERATE $1;

I feel neither is good approach. They just add unnecessary computing time,
especially the first one. And GROUP may result in too large of bags issue.

Any better suggestions?


Re: Pig optimization getting in the way?

2011-02-22 Thread Dexin Wang
So I can create multiple db connections for each (jdbc_url, table) pairs and
map each pair to its own connection for record writer. Is that what you are
suggesting? Sounds like a good plan. Thanks.

On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair te...@yahoo-inc.com wrote:

  As you are suspecting, both store functions are probably running in the
 same map or reduce task. This is a result of multi-query optimization.
 Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and you
 will be able to verify if the store is happening the same map/reduce task.

 Can you can make the db connection a member of the store function/ record
 writer?
 You can also use  -no_multiquery to prevent multi-query optimization from
 happening, but that will also result in the MR job being executed again for
 other output.

 Thanks,
 Thejas




 On 2/18/11 4:48 PM, Dexin Wang wangde...@gmail.com wrote:

 I hope that's the case. But

  *mapred.job.reuse.jvm.num.tasks* 1
 However it does seem to be doing the write to two DB tables in the same job
 so although it's not re-using jvm, it is already in one jvm since it's the
 same task!

 And since the DB connection is static/singleton as you mentioned, and table
 name (which is the only thing that's different) is not part of connection
 URL, they share the same DB connection, and one of them will close the
 connection when it's done.

 Hmm, any suggestions how we can handle this? Thanks.

 On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy dvrya...@gmail.com
 wrote:

  Let me guess -- you have a static JDBC connection that you open in
 myJDBC,
  and you have jvm reuse turned on.
 
  On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang wangde...@gmail.com wrote:
 
   I ran into a problem that I have spent quite some time on and start to
   think
   it's probably pig's doing something optimization that makes this thing
   hard.
  
   This is my pseudo code:
  
   raw = LOAD ...
  
   then some crazy stuff like
   filter
   join
   group
   UDF
   etc
  
   A = the result from above operation
   STORE A INTO 'dummy' USING myJDBC(write to table1);
  
   This works fine and I have 4 map-red jobs.
  
   Then I add this after that:
  
   B = FILTER A BY col1=xyz;
   STORE B INTO 'dummy2' USING myJDBC(write to table2);
  
   basically I do some filtering of A and write it to another table thru
  JDBC.
  
   Then I had the problem of jobs failing and saying PSQLException: This
   statement has been closed.
  
   My workaround now is to add EXEC; before B line and make them write
 to
  DB
   in sequence. This works but now it would run the same map-red jobs
 twice
  -
   I
   ended up with 8 jobs.
  
   I think the reason for the failure without EXEC line is because pig
 tries
   to
   do the two STORE in the same reducer (or mapper maybe) since B only
   involves
   FILTER which doesn't require a separate map-red job and then got
  confused.
  
   Is there a way for this to work without having to duplicate the jobs?
   Thanks
   a lot!
  
 





Pig optimization getting in the way?

2011-02-18 Thread Dexin Wang
I ran into a problem that I have spent quite some time on and start to think
it's probably pig's doing something optimization that makes this thing hard.

This is my pseudo code:

raw = LOAD ...

then some crazy stuff like
filter
join
group
UDF
etc

A = the result from above operation
STORE A INTO 'dummy' USING myJDBC(write to table1);

This works fine and I have 4 map-red jobs.

Then I add this after that:

B = FILTER A BY col1=xyz;
STORE B INTO 'dummy2' USING myJDBC(write to table2);

basically I do some filtering of A and write it to another table thru JDBC.

Then I had the problem of jobs failing and saying PSQLException: This
statement has been closed.

My workaround now is to add EXEC; before B line and make them write to DB
in sequence. This works but now it would run the same map-red jobs twice - I
ended up with 8 jobs.

I think the reason for the failure without EXEC line is because pig tries to
do the two STORE in the same reducer (or mapper maybe) since B only involves
FILTER which doesn't require a separate map-red job and then got confused.

Is there a way for this to work without having to duplicate the jobs? Thanks
a lot!


Re: Pig optimization getting in the way?

2011-02-18 Thread Dexin Wang
I hope that's the case. But

 *mapred.job.reuse.jvm.num.tasks* 1
However it does seem to be doing the write to two DB tables in the same job
so although it's not re-using jvm, it is already in one jvm since it's the
same task!

And since the DB connection is static/singleton as you mentioned, and table
name (which is the only thing that's different) is not part of connection
URL, they share the same DB connection, and one of them will close the
connection when it's done.

Hmm, any suggestions how we can handle this? Thanks.

On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Let me guess -- you have a static JDBC connection that you open in myJDBC,
 and you have jvm reuse turned on.

 On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang wangde...@gmail.com wrote:

  I ran into a problem that I have spent quite some time on and start to
  think
  it's probably pig's doing something optimization that makes this thing
  hard.
 
  This is my pseudo code:
 
  raw = LOAD ...
 
  then some crazy stuff like
  filter
  join
  group
  UDF
  etc
 
  A = the result from above operation
  STORE A INTO 'dummy' USING myJDBC(write to table1);
 
  This works fine and I have 4 map-red jobs.
 
  Then I add this after that:
 
  B = FILTER A BY col1=xyz;
  STORE B INTO 'dummy2' USING myJDBC(write to table2);
 
  basically I do some filtering of A and write it to another table thru
 JDBC.
 
  Then I had the problem of jobs failing and saying PSQLException: This
  statement has been closed.
 
  My workaround now is to add EXEC; before B line and make them write to
 DB
  in sequence. This works but now it would run the same map-red jobs twice
 -
  I
  ended up with 8 jobs.
 
  I think the reason for the failure without EXEC line is because pig tries
  to
  do the two STORE in the same reducer (or mapper maybe) since B only
  involves
  FILTER which doesn't require a separate map-red job and then got
 confused.
 
  Is there a way for this to work without having to duplicate the jobs?
  Thanks
  a lot!
 



Re: Use Filename in Tuple

2011-02-03 Thread Dexin Wang
Similarly, is it possible to insert some literal values to a tuple stream?

For example, when I invoke my Pig script, I already know what data source is
(say, it's from filename_2011-02-03), so I can just pass it to Pig using
-param, and I want to insert this known file name to the tuple stream. How
can I do that?

Example, I have:

grunt A = LOAD 'aa' AS (f1, f2);
grunt DUMP A;
(aa,bb)
(cc,dd)

I want to do something like:

grunt B = FOREACH A GENERATE f1, filename-2011-02-03;

Thanks.

On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 In pig 6, you can hook into bindTo() and save the file name.

 In pig 8 you have to find your way to the underlying InputSplit via
 PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
 on it.. I think. Haven't done this.

 This will totally break if you have splitCombination turned on, of
 course, as pig can silently move to a different file under you, so
 you'd have to turn that off.

 D

 On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt k...@simplegeo.com wrote:
  Hey,
 
  I have a bunch of files where the filename is significant.  I'm loading
 the
  files by supplying the top level directory that contains the files.  Is
  there a way to capture the filename of the file and append to the tuple
 of
  data that's in that file?
 
  -Kim
 



Re: Use Filename in Tuple

2011-02-03 Thread Dexin Wang
wow, I almost got it right. Double quote, fails. Single quote, works.

Thanks.

On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt k...@simplegeo.com wrote:

 This should work:

 grunt B = FOREACH A GENERATE f1, 'filename-2011-02-03';

 or

 grunt B = FOREACH A GENERATE f1, '$paramName';

 -Kim

 On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wangde...@gmail.com wrote:

  Similarly, is it possible to insert some literal values to a tuple
 stream?
 
  For example, when I invoke my Pig script, I already know what data source
  is
  (say, it's from filename_2011-02-03), so I can just pass it to Pig using
  -param, and I want to insert this known file name to the tuple stream.
 How
  can I do that?
 
  Example, I have:
 
  grunt A = LOAD 'aa' AS (f1, f2);
  grunt DUMP A;
  (aa,bb)
  (cc,dd)
 
  I want to do something like:
 
  grunt B = FOREACH A GENERATE f1, filename-2011-02-03;
 
  Thanks.
 
  On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy dvrya...@gmail.com
 wrote:
 
   In pig 6, you can hook into bindTo() and save the file name.
  
   In pig 8 you have to find your way to the underlying InputSplit via
   PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
   on it.. I think. Haven't done this.
  
   This will totally break if you have splitCombination turned on, of
   course, as pig can silently move to a different file under you, so
   you'd have to turn that off.
  
   D
  
   On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt k...@simplegeo.com wrote:
Hey,
   
I have a bunch of files where the filename is significant.  I'm
 loading
   the
files by supplying the top level directory that contains the files.
  Is
there a way to capture the filename of the file and append to the
 tuple
   of
data that's in that file?
   
-Kim
   
  
 



Re: failed to produce result

2011-01-31 Thread Dexin Wang
Thanks. That URL doesn't tell much.
*
*
*Job Name:* Job387546913066708402.jar
*Job File:*
 hdfs://hadoop-name01/hadoop/mapred/system/job_201101260357_3230/job.xml
*Job Setup:*None
*Status:* Failed
*Started at:* Mon Jan 31 15:48:36 CST 2011
*Failed at:* Mon Jan 31 15:48:36 CST 2011
*Failed in:* 0sec
*Job Cleanup:*None
--
Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed
Task Attempts map 100.00% 0 0 0 0 0 0 / 0reduce 100.00% 0 0 0 0 0 0 / 0


CounterMap ReduceTotal
--
Map Completion Graph - close

On Mon, Jan 31, 2011 at 2:37 PM, Thejas M Nair te...@yahoo-inc.com wrote:

  The logs say that the map-reduce job failed. Can you check the log files
 of the failed map-reduce tasks ?
 You can follow the jobtracker url in the log message - *
 http://hadoop-name02:50030/jobdetails.jsp?jobid=job_201101260357_3230
 *
 -Thejas



 On 1/31/11 1:54 PM, Dexin Wang wangde...@gmail.com wrote:

 Hi,

 I found similar problems on the web but didn't find a solution for it so
 I'm
 asking here.

 I have some pig job that has been working fine for couple of months and it
 started failing. But the same job still works if run as another account. I
 narrowed it a bit and found that the problematic user account can't even do
 a simple DUMP.


 --
 grunt  A = LOAD '/user/myuser1/aa' AS (f1, f2);
 grunt DESCRIBE A;
 A: {f1: bytearray,f2: bytearray}
 grunt DUMP A;
 2011-01-31 15:48:34,141 [main] INFO
  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name:

 Store(hdfs://hadoop-name01/tmp/temp811847645/tmp1546738024:org.apache.pig.builtin.BinStorage)
 - 1-10 Operator Key: 1-10)
 2011-01-31 15:48:34,142 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
 2011-01-31 15:48:34,142 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
 2011-01-31 15:48:34,153 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
 - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
 2011-01-31 15:48:35,562 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
 - Setting up single store job
 2011-01-31 15:48:35,574 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - 1 map-reduce job(s) waiting for submission.
 2011-01-31 15:48:35,578 [Thread-23] WARN
  org.apache.hadoop.mapred.JobClient
 - Use GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2011-01-31 15:48:36,077 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - 0% complete
 2011-01-31 15:48:36,176 [Thread-23] INFO
  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
 to process : 1
 2011-01-31 15:48:36,176 [Thread-23] INFO
  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
 input
 paths to process : 1
 2011-01-31 15:48:37,067 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - HadoopJobId: job_201101260357_3230
 2011-01-31 15:48:37,067 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - More information at:
 http://hadoop-name02:50030/jobdetails.jsp?jobid=job_201101260357_3230
 2011-01-31 15:48:41,605 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - 100% complete
 2011-01-31 15:48:41,605 [main] ERROR

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - 1 map reduce job(s) failed!
 2011-01-31 15:48:41,607 [main]* ERROR

 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - Failed to produce result in:
 hdfs://hadoop-namenode01/tmp/temp811847645/tmp1546738024*
 2011-01-31 15:48:41,607 [main] INFO

  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
 - Failed!
 2011-01-31 15:48:41,619 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1066: Unable to open iterator for alias A
 --

 While running the same LOAD, DUMP works fine with another user account. We
 also confirmed there is no diskspace or quota issue on namenode. Any idea?

 This is similar to this issue reported here:

 http://web.archiveorange.com/archive/v/3inw3wuad4S3zjAz89y5





wild card for all fields in a tuple

2011-01-12 Thread Dexin Wang
Hi,

Hope there is some simple answer to this. I have bunch of rows, for each
row, I want to add a column which is derived from some existing columns. And
I have large number of columns in my input tuple so I don't want to repeat
the name using AS when I generate. Is there an easy way just to append a
column to tuples without having to touch the tuple itself on the output.

Here's my example:

grunt DESCRIBE X;
X: {id: chararray,v1: int,v2: int}

grunt DUMP X;
(a,3,42)
(b,2,4)
(c,7,32)

I can do this:
grunt Y = FOREACH X GENERATE (v2 - v1) as diff, id, v1, v2;
grunt DUMP Y;
(39,a,3,42)
(2,b,2,4)
(25,c,7,32)

But I would prefer not to have to list all the v's. I may have v1, v2, v3,
..., v100.

Of course this doesn't work

grunt Y = FOREACH X GENERATE (v2 - v1) as diff, FLATTEN(X);

What can be done to simplify this? And related question, what is the schema
after the FOREACH, I wish I could do a DESCRIBE after FOREACH.

Thanks !!


Re: wild card for all fields in a tuple

2011-01-12 Thread Dexin Wang
Yeah, that works great. Thanks Jonathan and Alan. I can see that all fields
in between feature will be totally useful for some cases.

On Wed, Jan 12, 2011 at 3:33 PM, Alan Gates ga...@yahoo-inc.com wrote:

 Jonathan is right, you can do all fields in a tuple with *.  I was thinking
 of doing all fields in between two fields, which you can't do yet.

 Alan.


 On Jan 12, 2011, at 3:18 PM, Alan Gates wrote:

  There isn't a way to do that yet.  See
 https://issues.apache.org/jira/browse/PIG-1693
  for our plans on adding it in the next release.

 Alan.

 On Jan 12, 2011, at 2:51 PM, Dexin Wang wrote:

  Hi,

 Hope there is some simple answer to this. I have bunch of rows, for
 each
 row, I want to add a column which is derived from some existing
 columns. And
 I have large number of columns in my input tuple so I don't want to
 repeat
 the name using AS when I generate. Is there an easy way just to
 append a
 column to tuples without having to touch the tuple itself on the
 output.

 Here's my example:

 grunt DESCRIBE X;
 X: {id: chararray,v1: int,v2: int}

 grunt DUMP X;
 (a,3,42)
 (b,2,4)
 (c,7,32)

 I can do this:
 grunt Y = FOREACH X GENERATE (v2 - v1) as diff, id, v1, v2;
 grunt DUMP Y;
 (39,a,3,42)
 (2,b,2,4)
 (25,c,7,32)

 But I would prefer not to have to list all the v's. I may have v1,
 v2, v3,
 ..., v100.

 Of course this doesn't work

 grunt Y = FOREACH X GENERATE (v2 - v1) as diff, FLATTEN(X);

 What can be done to simplify this? And related question, what is the
 schema
 after the FOREACH, I wish I could do a DESCRIBE after FOREACH.

 Thanks !!






how to use builtin String functions

2011-01-12 Thread Dexin Wang
I see there are some builtin string functions, but I don't know how to use
them. I got this error when I follow the examples:

grunt REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');
2011-01-12 19:34:23,773 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  IDENTIFIER
REGEX_EXTRACT_ALL  at line 1, column 1.

Thanks.


Re: set reducer timeout with pig

2010-12-22 Thread Dexin Wang
It doesn't seem to work. I got

2010-12-22 21:36:59,120 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Unrecognized set key: mapred.task.timeout

I did this in my pig script:

SET mapred.task.timeout 180;

What did I do wrong? Thanks.

% pig --version
Apache Pig version 0.7.0+9 (rexported)
compiled Jun 28 2010, 12:53:50

On Tue, Dec 21, 2010 at 2:39 PM, Daniel Dai jiany...@yahoo-inc.com wrote:

 True, however there is one bug in 0.7. We fix it in 0.8.

 https://issues.apache.org/jira/browse/PIG-1760

 Daniel



 Ashutosh Chauhan wrote:

 Ideally you need not to do that. Pig automatically takes care of
 progress reporting in its operator. Do you have a pig script which
 fails because of reporting progress timeout issues ?

 Ashutosh

 On Tue, Dec 21, 2010 at 13:23, Dexin Wang wangde...@gmail.com wrote:


 Hi,

 How do I change the default timeout for reducer with Pig? I have some
 reducer that needs to take longer than 10 minutes to finish. It is pretty
 frustrating to see many of get to 95% complete and then got killed.

 Thanks.
 Dexin







increment counters in Pig UDF

2010-12-15 Thread Dexin Wang
Is it possible to increment a counter in Pig UDF (in either Load/Eval/Store
Func).

Since we have access to counters using the
org.apache.hadoop.mapred.Reporter:

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Counters

the other way to ask this question is how do we get an instance of Reporter
in UDF? Thanks.

Dexin


Eval UDF passing parameters

2010-12-07 Thread Dexin Wang
Hi,

This might be a dumb question. Is it possible to pass anything other than
the input tuple to a UDF Eval function?

Basically in my UDF, I need to do some user info lookup. So the input will
be:

(userid,f1,f2)

with this UDF, I want to convert it to something like

(userid,age,gender,location,f1,f2)

where in the UDF I do a DB lookup on the userid and returns user's info
(age, gender, etc). But I don't necessarily want to pass back the same user
info fields, e.g. sometimes I only want age.

I hope there is a way for me to tell the UDF that I only want age, and
sometimes age, location, etc.

What's the best way to achieve this without having to write a separate UDF
for every case?

Thanks.
Dexin


Re: Eval UDF passing parameters

2010-12-07 Thread Dexin Wang
ah nice. Thank you so much Zach!

On Tue, Dec 7, 2010 at 11:47 AM, Zach Bailey zach.bai...@dataclip.comwrote:


  You can pass parameters via the UDF constructor. For example:


 public MyUDF(boolean includeAge, boolean includeGender)


 then you would initialize it like so in your pig script:


 define MY_UDF_ONLY_AGE com.package.MyUDF(true, false)


 and use it like:


 data_with_age = FOREACH data GENERATE user_id, MY_UDF_ONLY_AGE(user_id);


 HTH,
 Zach


 On Tuesday, December 7, 2010 at 2:44 PM, Dexin Wang wrote:

  Hi,
 
  This might be a dumb question. Is it possible to pass anything other than
  the input tuple to a UDF Eval function?
 
  Basically in my UDF, I need to do some user info lookup. So the input
 will
  be:
 
  (userid,f1,f2)
 
  with this UDF, I want to convert it to something like
 
  (userid,age,gender,location,f1,f2)
 
  where in the UDF I do a DB lookup on the userid and returns user's info
  (age, gender, etc). But I don't necessarily want to pass back the same
 user
  info fields, e.g. sometimes I only want age.
 
  I hope there is a way for me to tell the UDF that I only want age, and
  sometimes age, location, etc.
 
  What's the best way to achieve this without having to write a separate
 UDF
  for every case?
 
  Thanks.
  Dexin
 
 
 
 





pass configuration param to UDF

2010-11-23 Thread Dexin Wang
Hi all,

I was reading this:

http://pig.apache.org/docs/r0.7.0/udf.html#Passing+Configurations+to+UDFs

It sounded like I can pass some configuration or context to the UDF but I
can't figure out how I would do that after I searched quite a bit on
internet and past discussion.

In my UDF, I can also do this:

UDFContext context = UDFContext.getUDFContext();
Properties properties =
context.getUDFProperties(this.getClass());

so if the context is set on the front end, supposedly, it will be in that
properties object. But how do I set it on the front end or whichever way to
pass it to UDF?

Thanks!
Dexin