replicated join gets extra job
Hi, I'm running a job like this: raw_large = LOAD 'lots_of_files' AS (...); raw_filtered = FILTER raw_large BY ...; large_table = FOREACH raw_filtered GENERATE f1, f2, f3,; joined_1 = JOIN large_table BY (key1) LEFT, config_table_1 BY (key2) USING 'replicated'; joined_2 = JOIN join1 BY (key3) LEFT, config_table_2 BY (key4) USING 'replicated'; joined_3 = JOIN join2 BY (key5) LEFT, config_table_3 BY (key6) USING 'replicated'; joined_4 = JOIN join4 BY (key7) LEFT, config_table_3 BY (key8) USING 'replicated'; basically left join a large table with 4 relatively small tables using the replicated join. I see a first load job has 120 mapper tasks and no reducer, and this job seems to be doing the load and filtering. And there is another job following that has 26 mapper tasks that seem to be doing the joins. Shouldn't there be only one job and the joins being done in the mapper phase of the first job? The 4 config tables (files) have these sizes respectively: 3MB 220kB 2kB 100kB these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB memory. Thanks!
Re: python version with Jython/Pig
Thanks. Instead, I found a Python implementation of the erf function, so that'll be good for now. http://stackoverflow.com/questions/457408/is-there-an-easily-available-implementation-of-erf-for-python On Wed, Jul 17, 2013 at 5:08 PM, Cheolsoo Park piaozhe...@gmail.com wrote: Hi Dexin, Unfortunately, Pig is on Jython 2.5, so you won't be able to use Python 2.7 modules. A while back, someone posted a hack to get Jython 2.7-b1 working with Pig. You might give it a try: http://search-hadoop.com/m/BnZs3MmH5y/jython+2.7subj=informational+getting+jython+2+7+b1+to+work Thanks, Cheolsoo On Wed, Jul 17, 2013 at 3:33 PM, Dexin Wang wangde...@gmail.com wrote: When I do Python UDF with Pig, how do we know which version of Python it is using? Is it possible to use a specific version of Python? Specifically my problem is in my UDF, I need to use a function in math module math.erf() which is newly introduced in Python version 2.7. I have Python 2.7 installed on my machine and standalone Python program runs fine but when I run it in Pig as Python UDF, I got this: AttributeError: type object 'org.python.modules.math' has no attribute 'erf' My guess is Jython is using some pre-2.7 version of Python? Thanks for your help! Dexin
python version with Jython/Pig
When I do Python UDF with Pig, how do we know which version of Python it is using? Is it possible to use a specific version of Python? Specifically my problem is in my UDF, I need to use a function in math module math.erf() which is newly introduced in Python version 2.7. I have Python 2.7 installed on my machine and standalone Python program runs fine but when I run it in Pig as Python UDF, I got this: AttributeError: type object 'org.python.modules.math' has no attribute 'erf' My guess is Jython is using some pre-2.7 version of Python? Thanks for your help! Dexin
Re: reference tuple field by name in UDF
Thanks. That'll be nice unfortunately we are using EMR which only has 0.9 so that's not an option for us. Similar question for Python UDF. In my Python UDF, is referencing field by index (instead of alias) is the only option I have? On Tue, Jan 15, 2013 at 2:20 PM, Jonathan Coveney jcove...@gmail.comwrote: Another way to do it would be to make a helper function that does the following: input.get(getInputSchema().getPosition(alias)); Only available in 0.10 and later (I think getInputSchema is in 0.10, at least...may only be in 0.11) 2013/1/15 Dexin Wang wangde...@gmail.com Hi, In my own UDF, is reference a field by index the only way to access a field? The fields are all named and typed before passing into UDF but looks like I can only do something like this: String v1 = (String)input.get(0); String v2 = (String)input.get(1); String v3 = (String)input.get(2); instead I'd like to do something like this: String v1 = (String)input.get(f1); String v2 = (String)input.get(f2); String v3 = (String)input.get(f3); since I have lots of field and I don't want to tie myself up the positioning of the fields. Any alternative? Thanks. Dexin
reference tuple field by name in UDF
Hi, In my own UDF, is reference a field by index the only way to access a field? The fields are all named and typed before passing into UDF but looks like I can only do something like this: String v1 = (String)input.get(0); String v2 = (String)input.get(1); String v3 = (String)input.get(2); instead I'd like to do something like this: String v1 = (String)input.get(f1); String v2 = (String)input.get(f2); String v3 = (String)input.get(f3); since I have lots of field and I don't want to tie myself up the positioning of the fields. Any alternative? Thanks. Dexin
Re: Passing a BAG to Pig UDF constructor?
Actually how do you pass a bag to UDF? I did this: a = LOAD 'file_a' AS (a1, a2, a3); *bag1* = LOAD 'somefile' AS (f1, f2, f3); b = FOREACH a GENERATE myUDF(*bag1*, a1, a2); But I got this error: Invalid scalar projection: bag1 : A column needs to be projected from a relation for it to be used as a scalar What is the right way of doing this? Thanks. On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang wangde...@gmail.com wrote: That's a good idea (to pass the bag to UDF and initialize it on first UDF invocation). Thanks. Why do you think it is expensive Mridul? On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan mrid...@yahoo-inc.com wrote: -Original Message- From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Wednesday, June 27, 2012 3:12 AM To: user@pig.apache.org Subject: Re: Passing a BAG to Pig UDF constructor? You can also just pass the bag to the UDF, and have a lazy initializer in exec that loads the bag into memory. Can you elaborate what you mean by pass the bag to the UDF ? Pass it as part of the input to the udf in exec and initialize it only once (first time) ? (If yes, this is expensive) Or something else ? Regards, Mridul 2012/6/26 Mridul Muralidharan mrid...@yahoo-inc.com You could dump the data in a dfs file and pass the location of the file as param to your udf in define - so that it initializes itself using that data ... - Mridul -Original Message- From: Dexin Wang [mailto:wangde...@gmail.com] Sent: Tuesday, June 26, 2012 10:58 PM To: user@pig.apache.org Subject: Passing a BAG to Pig UDF constructor? Is it possible to pass a bag to a Pig UDF constructor? Basically in the constructor I want to initialize some hash map so that on every exec operation, I can use the hashmap to do a lookup and find the value I need, and apply some algorithm to it. I realize I could just do a replicated join to achieve similar things but the algorithm is more than a few lines and there are some edge cases so I would rather wrap that logic inside a UDF function. I also realize I could just pass a file path to the constructor and read the files to initialize the hashmap but my files are on Amazon's S3 and I don't want to deal with S3 API to read the file. Is this possible or is there some alternative ways to achieve the same thing? Thanks. Dexin
Passing a BAG to Pig UDF constructor?
Is it possible to pass a bag to a Pig UDF constructor? Basically in the constructor I want to initialize some hash map so that on every exec operation, I can use the hashmap to do a lookup and find the value I need, and apply some algorithm to it. I realize I could just do a replicated join to achieve similar things but the algorithm is more than a few lines and there are some edge cases so I would rather wrap that logic inside a UDF function. I also realize I could just pass a file path to the constructor and read the files to initialize the hashmap but my files are on Amazon's S3 and I don't want to deal with S3 API to read the file. Is this possible or is there some alternative ways to achieve the same thing? Thanks. Dexin
Re: help!!--Does Pig can be use in this way?!
Or if it's simple like that, why not just grep? On Wed, Apr 4, 2012 at 7:07 AM, Corbin Hobus cor...@tynt.com wrote: If you are just finding the age of one person you are much better off using a regular database and SQL or hbase of you need some kind of quick random access. Hadoop/pig is for batch calculations where you need to sift through lots of data. Sent from my iPhone On Apr 4, 2012, at 6:35 AM, 凯氏图腾 t...@tengs.info wrote: Hello! I wonder if Pig can be use in this way:I have many records,each record content name and age,now I want to find out one person's age.I know it is easy to make it in Pig Latin, but I just wonder whether it is a good idea to select one record in this way or it can't be better than to make it in SQL? Thanks!!
Re: filter out null lines returned by UDF
yeah. That works great. Thanks you Jonathan. On Thu, Mar 1, 2012 at 5:14 PM, Jonathan Coveney jcove...@gmail.com wrote: FLATTEN is kind of quirky. If you FLATTEN(null), it will return null, but if you FLATTEN a bag that is empty (ie size=0), it will throw away the row. I would have your UDF return an empty bag and let the flatten wipe it out. 2012/3/1 Dexin Wang wangde...@gmail.com Hi, I have a UDF that parses a line and then return a bag, and sometimes the line is bad so I'm returning null in the UDF. In my pig script, I'd like to filter those nulls like this: raw = LOAD 'raw_input' AS (line:chararray); parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));-- get two fields in the tuple: id and name DUMP parsed; (id1,name1) (id2,name2) () (id3,name3) parsed_no_nulls = FILTER parsed BY id IS NOT NULL; DUMP parsed_no_nulls; (id1,name1) (id2,name2) (id3,name3) This works, but I'm getting this warning: WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input When I try to use IsEmpty to filter, I get this error Cannot test a NULL for emptiness. What's the correct way to filter out these null bags returned from my UDF? Thanks. Dexin
filter out null lines returned by UDF
Hi, I have a UDF that parses a line and then return a bag, and sometimes the line is bad so I'm returning null in the UDF. In my pig script, I'd like to filter those nulls like this: raw = LOAD 'raw_input' AS (line:chararray); parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));-- get two fields in the tuple: id and name DUMP parsed; (id1,name1) (id2,name2) () (id3,name3) parsed_no_nulls = FILTER parsed BY id IS NOT NULL; DUMP parsed_no_nulls; (id1,name1) (id2,name2) (id3,name3) This works, but I'm getting this warning: WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject: Attempt to access field which was not found in the input When I try to use IsEmpty to filter, I get this error Cannot test a NULL for emptiness. What's the correct way to filter out these null bags returned from my UDF? Thanks. Dexin
Re: pig 0.9 slower in local mode?
Cool. Looking forward to trying out 0.9.2+patch. On Mon, Dec 19, 2011 at 2:31 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Yes, starting with 0.7 local mode was moved to sit on top of Hadoop's local mode (instead of being a completely separate implementation). This had two effects, one desirable, one not so much: 1) get rid of surprises / bugs caused by having to support 2 completely separate implementations of the runtime 2) slow the thing down a lot. Julien sped up local mode quite a bit in 0.9.2, and we have another patch coming that further removes unnecessary slowness, but I'm afraid hadoop-based local mode will never be quite as fast as the old local-mode... D On Mon, Dec 19, 2011 at 2:23 PM, Dexin Wang wangde...@gmail.com wrote: I recently switched to pig 0.9.1 and noticed it runs slower than previous version (like 0.6 which was only recent version supported on Amazon couple of months ago) in local mode. Haven't tried the timing in hadoop mode yet. I figure it is probably due to some extra debugging or some parameter. Anything I can do to make it faster? Thanks, Dexin
pig 0.9 slower in local mode?
I recently switched to pig 0.9.1 and noticed it runs slower than previous version (like 0.6 which was only recent version supported on Amazon couple of months ago) in local mode. Haven't tried the timing in hadoop mode yet. I figure it is probably due to some extra debugging or some parameter. Anything I can do to make it faster? Thanks, Dexin
Re: multiple folder loading or passing comma on parameter with Amazon Pig
I will. There is also a bug on Pig documentation here: http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html where it says In this example the command is executed and its stdout is used as the parameter value. %declare CMD 'generate_date'; it should really be `generate_date` with the back ticks, not the single quotes. On Wed, Aug 17, 2011 at 6:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Nice job figuring out a fix! You should seriously file a bug with AMR for that. That's kind of ridiculous. D On Wed, Aug 17, 2011 at 6:03 PM, Dexin Wang wangde...@gmail.com wrote: I solved my own problem and just want to share with whoever might encounter the same issue. I pass colon separated list then convert it to comma separated list inside pig script using declare command. Submit pig job like this: -p SOURCE_DIRS=2011-08:2011-07:2011-06 and in Pig script % declare SOURCE_DIRS_CONVERTED `echo $SOURCE_DIRS | tr ':' ','`; LOAD '/root_dir/{$SOURCE_DIRS_CONVERTED}' ... On Wed, Aug 17, 2011 at 4:21 PM, Dexin Wang wangde...@gmail.com wrote: Hi, I'm running pig jobs using Amazon pig support, where you submit jobs with comma concatenated parameters like this: elastic-mapreduce --pig-script --args myscript.pig --args -p,PARAM1=value1,-p,PARAM2=value2,-p,PARAM3=value3 In my script, I need to pass multiple directories for the pig script to load like this: raw = LOAD '/root_dir/{$SOURCE_DIRS}' and SOURCE_DIRS is computed. For example, it can be 2011-08,2011-07,20110-06, meaning my pig script need to load data for the past 3 months. This works fine when I run my job using local or direct hadoop mode. But with Amazon pig, I have to do something like this: elastic-mapreduce --pig-script --args myscript.pig -p,SOURCE_DIRS=2011-08,2011-07,2011-06 but emr will just replace commas with spaces so it breaks the parameter passing syntax. I've tried adding backslashes before commas, but I simply end up with back slash with space in between. So question becomes: 1. can I do something differently than what I'm doing to pass multiple folders to pig script (without commas), or 2. anyone knows how to properly pass commas to elastic-mapreduce ? Thanks! Dexin
conditional and multiple generate inside foreach?
Possible to do conditional and more than one generate inside a foreach? for example, I have tuples like this (names, days_ago) (a,0) (b,1) (c,9) (d,40) b shows up 1 day ago, so it belongs to all of the following: yesterday, last week, last month, and last quarter. So I'd like to turn the above to: (a,0,today) (b,1,yesterday) (b,1,week) (b,1,month) (b,1,quarter) (c,9,month) (c,9,quarter) (d,40,quarter) I imagine/dream I could do something like this B = FOREACH A { if (days_ago = 90) generate name,days_ago,'quarter'; if (days_ago = 30) generate name,days_ago,'month'; if (days_ago = 7) generate name,days_ago,'week'; if (days_ago == 1) generate name,days_ago,'yesterday'; if (days_ago == 0) generate name,days_ago,'today'; } of course that's not valid syntax. I could write my own UDF but would be nice there's some way to get what I want without UDF. Thanks! Dexin
Re: why the udf can not work
You need to have your class file in this path /home/huyong/test/myudfs/UPPER.class since it's in myudfs directory. On Jun 18, 2011, at 12:33 PM, 勇胡 yongyong...@gmail.com wrote: I tried your command and then it shows me as following: /home/huyong/test/UPPER.class /home/huyong/test/UPPER.java Yong 在 2011年6月18日 下午4:29,Dmitriy Ryaboy dvrya...@gmail.com写道: This usually hapens when you aren't registering what you think you are registering. try `jar tf /home/huyong/test/myudfs.jar | grep UPPER` and see if you get anything. D 2011/6/18 勇胡 yongyong...@gmail.com: Hi, package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.*; public class UPPER extends EvalFuncString { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw new IOException(e); } } } This is as same as the example from the Pig website. By the way, I also added the PIG_CLASS. But it still didn't work. Yong 2011/6/18 Jonathan Coveney jcove...@gmail.com Can you paste the content of the UDF? 2011/6/18 勇胡 yongyong...@gmail.com Hello, I just tried the example from the pig udf manual step by step. But I got the error information. Can anyone tell me how to solve it? grunt REGISTER /home/huyong/test/myudfs.jar; grunt A = LOAD '/home/huyong/test/student.txt' as (name:chararray); grunt B = FOREACH A GENERATE myudfs.UPPER(name); 2011-06-18 11:15:38,892 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve myudfs.UPPER using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /home/huyong/test/pig_1308388238352.log I have already registered the udf, why pig tries to search from the builtin path. Thanks for your help! Yong Hu
Re: pig script takes much longer than java MR job
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time. I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :) On Jun 17, 2011, at 5:06 PM, Jonathan Coveney jcove...@gmail.com wrote: A couple of possibilities that I'm kicking around off the top of my head... 1) Does your MR job also sort afterwards? That's going to kick off another MR job 2) Does your MR job compile all the results into one job? My guess is the Order+Dump are making it take longer. 2011/6/17 Sujee Maniyam su...@sujee.net I have log files like this: #timestamp (ms), server,user,action,domain , x,y , z 126233288, 7, 50817, 2, yahoo.com, 31, blahblah, foobar 1262332800017, 2, 373168, 0, google.com, 67, blahblah, foobar 1262332800025, 8, 172910, 1, facebook.com, 135, blahblah, foobar I have the following pig script to count the number of domains from logs. ( For example, we have seen facebook.com 10 times ..etc.) Here is the pig script: records = LOAD '/logs-in/*.log' using PigStorage(',') AS (ts:long, server:int, user:int, action_id:int, domain:chararray, price:int); -- DUMP records; grouped_by_domain = GROUP records BY domain; -- DUMP grouped_by_domain; -- DESCRIBE grouped_by_domain; freq = FOREACH grouped_by_domain GENERATE group as domain, COUNT(records) as mycount; -- DESCRIBE freq; -- DUMP freq; sorted = ORDER freq BY mycount DESC; DUMP sorted; This script takes a hour to run. I also wrote a simple Java MR job to count the domains, it takes about 15 mins. So the pig script is taking 4x longer to complete. any suggestions on what I am doing wrong in pig? thanks Sujee http://sujee.net
Re: running pig on amazon ec2
Thanks a lot for the good advice. I'll see if I can get lzo setup. Currently I'm using emr which uses pig 0.6. I'll looking into whirr to start the hadoop cluster on ec2. There is one place in my job where I can use replicated join, I'm sure that will cut down some time. What I find interesting is without doing any optimization on configuration or code side, I get 2x to 4x speed up by just using the *Cluster Compute Quadruple Extra Large Instance* (cc1.4xlarge) as oppose to the regular Large instance (m1.large) on the $$. They do claim cc1.4xlarge's IO is very high. Since I suspect most of my job was spending time reading/writing disk, this speedup makes sense. On Wed, Jun 15, 2011 at 6:46 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: you need to add this to your pig.properties: pig.tmpfilecompression=true pig.tmpfilecompression.codec=lzo Make sure that you are running hadoop 20.2 or higher, pig 8.1 or higher, and that all the lzo stuff is set up -- it's a bit involved. Use replicated joins where possible. If you are doing a large number of small jobs, scheduling and provisioning is likely to dominate -- tune your job scheduler to schedule more tasks per heartbeat and make sure your jar is as small as you can get it (there's a lot of unjarring going on in Hadoop) D On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang wangde...@gmail.com wrote: Tomas, What worked well for me is still to be figured out. Right now, it works but it's too slow. I think one of the main problem is that my job has many JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk which is slow. On that node, anyone knows how to know if the lzo is turned on for intermediate jobs. Reference to this http://pig.apache.org/docs/r0.8.0/cookbook.html#Compress+the+Results+of+Intermediate+Jobs and this http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ I see I have this in my mapred-site.xml file: propertynamemapred.map.output.compression.codec/name valuecom.hadoop.compression.lzo.LzoCodec/value/property Is that all I need to have map compression turned on? Thanks. Dexin On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky svarovsky.to...@gmail.comwrote: Hi Dexin, Since I am being a Pig and map reduce newbie your post is very intriguing for me. I am coming from Talend background and trying to asses if map/reduce would bring any possible speed up and faster turnaround to my projects. My worries are that my data are to small so that map reduce overhead will be prohibitive in certain cases. When using Talend if the transformation was reasonable it could process 10s of thousand rows per second. Processing 1 million rows could be finished well under 1 minute so I think that your dataset is fairly small. Nevertheless my data are growing so soon it wil be time for pig. Could you provide some info what worked well for you to run your job on EC2? Thanks in advance, Tomas On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai jiany...@yahoo-inc.com wrote: If the job finishes in 3 minutes in local mode, I would think it is small. On 06/14/2011 11:07 AM, Dexin Wang wrote: Good to know. Trying single node hadoop cluster now. The main input is about 1+ million lines of events. After some aggregation, it joins with another input source which has also about 1+ million rows. Is this considered small query? Thanks. On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai jiany...@yahoo-inc.com mailto:jiany...@yahoo-inc.com wrote: Local mode and mapreduce mode makes a huge difference. For a small query, the mapreduce overhead will dominate. For a fair comparison, can you setup a single node hadoop cluster on your laptop and run Pig on it? Daniel On 06/14/2011 10:54 AM, Dexin Wang wrote: Thanks for your feedback. My comments below. On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai jiany...@yahoo-inc.com mailto:jiany...@yahoo-inc.com wrote: Curious, couple of questions: 1. Are you running in local mode or mapreduce mode? Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I ran it on ec2 cluster. 2. If mapreduce mode, did you look into the hadoop log to see how much slow down each mapreduce job does? I'm looking into that. 3. What kind of query is it? The input is gzipped json files which has one event per line. Then I do some hourly aggregation on the raw events, then do bunch of groupping, joining and some metrics computing (like median, variance) on some fields. Daniel Someone mentioned it's EC2's I/O performance. But I'm sure there are plenty of people using EC2/EMR running big MR jobs so more likely I have some configuration issues? My jobs can
Re: running pig on amazon ec2
Thanks for your feedback. My comments below. On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai jiany...@yahoo-inc.com wrote: Curious, couple of questions: 1. Are you running in local mode or mapreduce mode? Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I ran it on ec2 cluster. 2. If mapreduce mode, did you look into the hadoop log to see how much slow down each mapreduce job does? I'm looking into that. 3. What kind of query is it? The input is gzipped json files which has one event per line. Then I do some hourly aggregation on the raw events, then do bunch of groupping, joining and some metrics computing (like median, variance) on some fields. Daniel Someone mentioned it's EC2's I/O performance. But I'm sure there are plenty of people using EC2/EMR running big MR jobs so more likely I have some configuration issues? My jobs can be optimized a bit but the fact that running on my laptop is faster tells me this is a separate issue. Thanks! On 06/13/2011 11:54 AM, Dexin Wang wrote: Hi, This is probably not directly a Pig question. Anyone running Pig on amazon EC2 instances? Something's not making sense to me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node cluster using m1.small. It took *13 minutes*. The job reads input from S3 and writes output to S3. But from the logs the reading and writing part to/from S3 is pretty fast. And all the intermediate steps should happen on HDFS. Running the same job on my mbp laptop, it only took *3 minutes*. Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6 on my laptop. Some hadoop config is probably also not ideal. I tried m1.large instead of m1.small, doesn't seem to make a huge difference. Anything you would suggest to look for the slowness on EC2? Dexin
running pig on amazon ec2
Hi, This is probably not directly a Pig question. Anyone running Pig on amazon EC2 instances? Something's not making sense to me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node cluster using m1.small. It took *13 minutes*. The job reads input from S3 and writes output to S3. But from the logs the reading and writing part to/from S3 is pretty fast. And all the intermediate steps should happen on HDFS. Running the same job on my mbp laptop, it only took *3 minutes*. Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig 0.6 on my laptop. Some hadoop config is probably also not ideal. I tried m1.large instead of m1.small, doesn't seem to make a huge difference. Anything you would suggest to look for the slowness on EC2? Dexin
Re: Setting the store file name with date
I don't think version is a problem. variables is probably supported from the start of the Pig. Using STORE result INTO 'out-$date'; I mentioned about, when you run the pig script, you just add -param date=20110522' to your command line. What is the problem you see? 2011/5/21 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com Thanks Dexin! I tried that but that did not work for me. I am using Pig 0.7, what version of Pig do you use? Yeah I think I could move it aside, but the problem is that I need to keep track of the results, and if I move them aside then I would have to rename the results of each job sequentially because my jobs can repeat many times, but their results are different. Thanks again. Renato M. 2011/5/20 Dexin Wang wangde...@gmail.com: Yeah I do that all the time. STORE result INTO 'out-$date'; Or you could run the pig script then after it's done move the result aside. On May 20, 2011, at 6:51 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi, I have a sequence of jobs which are run daily and usually the logs and results are erased every time they have to be re-run. Now we want to keep those logs and results, but if the results already exist, the pig job fails. I thought that maybe setting the results' name + date would solve it for me. Can I do that from pig? Do you guys have any other suggestions? Thanks in advance. Renato M.
Re: Setting the store file name with date
Yeah I do that all the time. STORE result INTO 'out-$date'; Or you could run the pig script then after it's done move the result aside. On May 20, 2011, at 6:51 PM, Renato Marroquín Mogrovejorenatoj.marroq...@gmail.com wrote: Hi, I have a sequence of jobs which are run daily and usually the logs and results are erased every time they have to be re-run. Now we want to keep those logs and results, but if the results already exist, the pig job fails. I thought that maybe setting the results' name + date would solve it for me. Can I do that from pig? Do you guys have any other suggestions? Thanks in advance. Renato M.
elephantbird JsonLoader doesn't like gz?
Hi, Anyone using Twitter's elephantbird library? I was using its JsonLoader and got this error: WARN com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode string Unexpected character () at position 0. at org.json.simple.parser.Yylex.yylex(Unknown Source) at org.json.simple.parser.JSONParser.nextToken(Unknown Source) at org.json.simple.parser.JSONParser.parse(Unknown Source) at org.json.simple.parser.JSONParser.parse(Unknown Source) But if I manually gunzip the file to a clear text json file, JsonLoader works fine. Again this fails: raw_json = LOAD 'cc.json.gz' USING com.twitter.elephantbird.pig.load.JsonLoader(); this works: $ gunzip cc.json.gz raw_json = LOAD 'cc.json' USING com.twitter.elephantbird.pig.load.JsonLoader(); Any suggestions for this? Or is there any other json loader library out there? I can write my own but would rather use one if already exists. Thanks, Dexin
Re: elephantbird JsonLoader doesn't like gz?
Or is it because I'm using Pig 0.6 where gz format is not supported? I'll run this on aws EMR which only pig 0.6 is supported. I have to use later version of Pig? On Wed, May 18, 2011 at 11:12 AM, Dexin Wang wangde...@gmail.com wrote: Hi, Anyone using Twitter's elephantbird library? I was using its JsonLoader and got this error: WARN com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode string Unexpected character () at position 0. at org.json.simple.parser.Yylex.yylex(Unknown Source) at org.json.simple.parser.JSONParser.nextToken(Unknown Source) at org.json.simple.parser.JSONParser.parse(Unknown Source) at org.json.simple.parser.JSONParser.parse(Unknown Source) But if I manually gunzip the file to a clear text json file, JsonLoader works fine. Again this fails: raw_json = LOAD 'cc.json.gz' USING com.twitter.elephantbird.pig.load.JsonLoader(); this works: $ gunzip cc.json.gz raw_json = LOAD 'cc.json' USING com.twitter.elephantbird.pig.load.JsonLoader(); Any suggestions for this? Or is there any other json loader library out there? I can write my own but would rather use one if already exists. Thanks, Dexin
Re: elephantbird JsonLoader doesn't like gz?
Turns out it's only a problem if I run it in local mode, running it in cluster doesn't have this problem. I'm using EB1.2.5. Wonder how you fix the problem since it seems it's not EB problem. Or are you gunzipping it in EB load function? On Wed, May 18, 2011 at 8:43 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Which version of EB are you using? I recently fixed this for someone, I believe it's been in every version since 1.2.3 D On Wed, May 18, 2011 at 11:26 AM, Dexin Wang wangde...@gmail.com wrote: Or is it because I'm using Pig 0.6 where gz format is not supported? I'll run this on aws EMR which only pig 0.6 is supported. I have to use later version of Pig? On Wed, May 18, 2011 at 11:12 AM, Dexin Wang wangde...@gmail.com wrote: Hi, Anyone using Twitter's elephantbird library? I was using its JsonLoader and got this error: WARN com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode string Unexpected character () at position 0. at org.json.simple.parser.Yylex.yylex(Unknown Source) at org.json.simple.parser.JSONParser.nextToken(Unknown Source) at org.json.simple.parser.JSONParser.parse(Unknown Source) at org.json.simple.parser.JSONParser.parse(Unknown Source) But if I manually gunzip the file to a clear text json file, JsonLoader works fine. Again this fails: raw_json = LOAD 'cc.json.gz' USING com.twitter.elephantbird.pig.load.JsonLoader(); this works: $ gunzip cc.json.gz raw_json = LOAD 'cc.json' USING com.twitter.elephantbird.pig.load.JsonLoader(); Any suggestions for this? Or is there any other json loader library out there? I can write my own but would rather use one if already exists. Thanks, Dexin
Re: reducer throttling?
Thanks for your explanation Alex. In some cases, there isn't even a reduce phase. For example, we have some raw data, after our custom LOAD function and some filter function, it directly goes into DB. And since we don't have control on number of mappers, we end up with too many DB writers. That's why I had to add that artificial reduce phase I mentioned earlier so that we can throttle it down. We could also do what someone else suggested - add a post process step that writes output to HDFS and load DB from that. But there are other considerations that we'd like not to do that if we don't have to. On Thu, Mar 17, 2011 at 2:16 PM, Alex Rovner alexrov...@gmail.com wrote: Dexin, You can control the amount of reducers by adding the following in your pig script: SET default_parallel 29; Pig will run with 29 reducers with the above statement. As far as the bulk insert goes: We are using MS-SQL as our database, but MySQL would be able to handle the bulk insert the same way. Essentially we are directing the output of the job into a temporary folder in order to know the output of this particular run. If you set the amount of reducers to 29, you will have 29 files in the temp folder after the job completes. You can then run a bulk insert SQL command on each of the resulting files with pointing to HDFS either through FUSE(The way we do it) or you can copy the resulting files to a samba share or NFS and point the SQL server to that location. In order to bulk insert you would have to either A. Do this in a post processing script or write your own storage func that takes care of this. Storage func is tricky since you will need to implement your own outputcommiter (See https://issues.apache.org/jira/browse/PIG-1891) Let me know if you have further questions. Alex On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang wangde...@gmail.com wrote: Can you describe a bit more about your bulk insert technique? And the way you control the number of reducers is also by adding artificial ORDER or GROUP step? Thanks! On Thu, Mar 17, 2011 at 1:33 PM, Alex Rovner alexrov...@gmail.comwrote: We use bulk insert technique after the job completes. You can control the amount of each bulk insert by controlling the amount of reducers. Sent from my iPhone On Mar 17, 2011, at 2:03 PM, Dexin Wang wangde...@gmail.com wrote: We do some processing in hadoop then as the last step, we write the result to database. Database is not good at handling hundreds of concurrent connections and fast writes. So we need to throttle down the number of tasks that writes to DB. Since we have no control on the number of mappers, we add an artificial reducer step to achieve that, either by doing GROUP or ORDER, like this: sorted_data = ORDER data BY f1 PARALLEL 10; -- then write sorted_data to DB or grouped_data = GROUP data BY f1 PARALLEL 10; data_to_write = FOREACH grouped_data GENERATE $1; I feel neither is good approach. They just add unnecessary computing time, especially the first one. And GROUP may result in too large of bags issue. Any better suggestions?
possibly Pig throttles the number of mappers
Hi, We've seen a strange problem where some Pig jobs would just run fewer mappers concurrently than the mapper capacity. Specifically we have a 10 node cluster and each is configured to have 12 mappers. Normally we have 120 mappers running. But for some Pig jobs it will only have 10 mappers running (while nothing else is running), and actually appears to be 1 mapper per node. We have not noticed the same problem with other non-Pig hadoop job. Anyone has experienced the same thing and have any explanation or remedy? Thanks! Dexin
reducer throttling?
We do some processing in hadoop then as the last step, we write the result to database. Database is not good at handling hundreds of concurrent connections and fast writes. So we need to throttle down the number of tasks that writes to DB. Since we have no control on the number of mappers, we add an artificial reducer step to achieve that, either by doing GROUP or ORDER, like this: sorted_data = ORDER data BY f1 PARALLEL 10; -- then write sorted_data to DB or grouped_data = GROUP data BY f1 PARALLEL 10; data_to_write = FOREACH grouped_data GENERATE $1; I feel neither is good approach. They just add unnecessary computing time, especially the first one. And GROUP may result in too large of bags issue. Any better suggestions?
Re: Pig optimization getting in the way?
So I can create multiple db connections for each (jdbc_url, table) pairs and map each pair to its own connection for record writer. Is that what you are suggesting? Sounds like a good plan. Thanks. On Fri, Feb 18, 2011 at 5:31 PM, Thejas M Nair te...@yahoo-inc.com wrote: As you are suspecting, both store functions are probably running in the same map or reduce task. This is a result of multi-query optimization. Try pig –e ‘explain –script yourscript.pig’ to see the query plan, and you will be able to verify if the store is happening the same map/reduce task. Can you can make the db connection a member of the store function/ record writer? You can also use -no_multiquery to prevent multi-query optimization from happening, but that will also result in the MR job being executed again for other output. Thanks, Thejas On 2/18/11 4:48 PM, Dexin Wang wangde...@gmail.com wrote: I hope that's the case. But *mapred.job.reuse.jvm.num.tasks* 1 However it does seem to be doing the write to two DB tables in the same job so although it's not re-using jvm, it is already in one jvm since it's the same task! And since the DB connection is static/singleton as you mentioned, and table name (which is the only thing that's different) is not part of connection URL, they share the same DB connection, and one of them will close the connection when it's done. Hmm, any suggestions how we can handle this? Thanks. On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Let me guess -- you have a static JDBC connection that you open in myJDBC, and you have jvm reuse turned on. On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang wangde...@gmail.com wrote: I ran into a problem that I have spent quite some time on and start to think it's probably pig's doing something optimization that makes this thing hard. This is my pseudo code: raw = LOAD ... then some crazy stuff like filter join group UDF etc A = the result from above operation STORE A INTO 'dummy' USING myJDBC(write to table1); This works fine and I have 4 map-red jobs. Then I add this after that: B = FILTER A BY col1=xyz; STORE B INTO 'dummy2' USING myJDBC(write to table2); basically I do some filtering of A and write it to another table thru JDBC. Then I had the problem of jobs failing and saying PSQLException: This statement has been closed. My workaround now is to add EXEC; before B line and make them write to DB in sequence. This works but now it would run the same map-red jobs twice - I ended up with 8 jobs. I think the reason for the failure without EXEC line is because pig tries to do the two STORE in the same reducer (or mapper maybe) since B only involves FILTER which doesn't require a separate map-red job and then got confused. Is there a way for this to work without having to duplicate the jobs? Thanks a lot!
Pig optimization getting in the way?
I ran into a problem that I have spent quite some time on and start to think it's probably pig's doing something optimization that makes this thing hard. This is my pseudo code: raw = LOAD ... then some crazy stuff like filter join group UDF etc A = the result from above operation STORE A INTO 'dummy' USING myJDBC(write to table1); This works fine and I have 4 map-red jobs. Then I add this after that: B = FILTER A BY col1=xyz; STORE B INTO 'dummy2' USING myJDBC(write to table2); basically I do some filtering of A and write it to another table thru JDBC. Then I had the problem of jobs failing and saying PSQLException: This statement has been closed. My workaround now is to add EXEC; before B line and make them write to DB in sequence. This works but now it would run the same map-red jobs twice - I ended up with 8 jobs. I think the reason for the failure without EXEC line is because pig tries to do the two STORE in the same reducer (or mapper maybe) since B only involves FILTER which doesn't require a separate map-red job and then got confused. Is there a way for this to work without having to duplicate the jobs? Thanks a lot!
Re: Pig optimization getting in the way?
I hope that's the case. But *mapred.job.reuse.jvm.num.tasks* 1 However it does seem to be doing the write to two DB tables in the same job so although it's not re-using jvm, it is already in one jvm since it's the same task! And since the DB connection is static/singleton as you mentioned, and table name (which is the only thing that's different) is not part of connection URL, they share the same DB connection, and one of them will close the connection when it's done. Hmm, any suggestions how we can handle this? Thanks. On Fri, Feb 18, 2011 at 3:38 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Let me guess -- you have a static JDBC connection that you open in myJDBC, and you have jvm reuse turned on. On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang wangde...@gmail.com wrote: I ran into a problem that I have spent quite some time on and start to think it's probably pig's doing something optimization that makes this thing hard. This is my pseudo code: raw = LOAD ... then some crazy stuff like filter join group UDF etc A = the result from above operation STORE A INTO 'dummy' USING myJDBC(write to table1); This works fine and I have 4 map-red jobs. Then I add this after that: B = FILTER A BY col1=xyz; STORE B INTO 'dummy2' USING myJDBC(write to table2); basically I do some filtering of A and write it to another table thru JDBC. Then I had the problem of jobs failing and saying PSQLException: This statement has been closed. My workaround now is to add EXEC; before B line and make them write to DB in sequence. This works but now it would run the same map-red jobs twice - I ended up with 8 jobs. I think the reason for the failure without EXEC line is because pig tries to do the two STORE in the same reducer (or mapper maybe) since B only involves FILTER which doesn't require a separate map-red job and then got confused. Is there a way for this to work without having to duplicate the jobs? Thanks a lot!
Re: Use Filename in Tuple
Similarly, is it possible to insert some literal values to a tuple stream? For example, when I invoke my Pig script, I already know what data source is (say, it's from filename_2011-02-03), so I can just pass it to Pig using -param, and I want to insert this known file name to the tuple stream. How can I do that? Example, I have: grunt A = LOAD 'aa' AS (f1, f2); grunt DUMP A; (aa,bb) (cc,dd) I want to do something like: grunt B = FOREACH A GENERATE f1, filename-2011-02-03; Thanks. On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: In pig 6, you can hook into bindTo() and save the file name. In pig 8 you have to find your way to the underlying InputSplit via PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() on it.. I think. Haven't done this. This will totally break if you have splitCombination turned on, of course, as pig can silently move to a different file under you, so you'd have to turn that off. D On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt k...@simplegeo.com wrote: Hey, I have a bunch of files where the filename is significant. I'm loading the files by supplying the top level directory that contains the files. Is there a way to capture the filename of the file and append to the tuple of data that's in that file? -Kim
Re: Use Filename in Tuple
wow, I almost got it right. Double quote, fails. Single quote, works. Thanks. On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt k...@simplegeo.com wrote: This should work: grunt B = FOREACH A GENERATE f1, 'filename-2011-02-03'; or grunt B = FOREACH A GENERATE f1, '$paramName'; -Kim On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang wangde...@gmail.com wrote: Similarly, is it possible to insert some literal values to a tuple stream? For example, when I invoke my Pig script, I already know what data source is (say, it's from filename_2011-02-03), so I can just pass it to Pig using -param, and I want to insert this known file name to the tuple stream. How can I do that? Example, I have: grunt A = LOAD 'aa' AS (f1, f2); grunt DUMP A; (aa,bb) (cc,dd) I want to do something like: grunt B = FOREACH A GENERATE f1, filename-2011-02-03; Thanks. On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: In pig 6, you can hook into bindTo() and save the file name. In pig 8 you have to find your way to the underlying InputSplit via PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath() on it.. I think. Haven't done this. This will totally break if you have splitCombination turned on, of course, as pig can silently move to a different file under you, so you'd have to turn that off. D On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt k...@simplegeo.com wrote: Hey, I have a bunch of files where the filename is significant. I'm loading the files by supplying the top level directory that contains the files. Is there a way to capture the filename of the file and append to the tuple of data that's in that file? -Kim
Re: failed to produce result
Thanks. That URL doesn't tell much. * * *Job Name:* Job387546913066708402.jar *Job File:* hdfs://hadoop-name01/hadoop/mapred/system/job_201101260357_3230/job.xml *Job Setup:*None *Status:* Failed *Started at:* Mon Jan 31 15:48:36 CST 2011 *Failed at:* Mon Jan 31 15:48:36 CST 2011 *Failed in:* 0sec *Job Cleanup:*None -- Kind % CompleteNum Tasks PendingRunningComplete KilledFailed/Killed Task Attempts map 100.00% 0 0 0 0 0 0 / 0reduce 100.00% 0 0 0 0 0 0 / 0 CounterMap ReduceTotal -- Map Completion Graph - close On Mon, Jan 31, 2011 at 2:37 PM, Thejas M Nair te...@yahoo-inc.com wrote: The logs say that the map-reduce job failed. Can you check the log files of the failed map-reduce tasks ? You can follow the jobtracker url in the log message - * http://hadoop-name02:50030/jobdetails.jsp?jobid=job_201101260357_3230 * -Thejas On 1/31/11 1:54 PM, Dexin Wang wangde...@gmail.com wrote: Hi, I found similar problems on the web but didn't find a solution for it so I'm asking here. I have some pig job that has been working fine for couple of months and it started failing. But the same job still works if run as another account. I narrowed it a bit and found that the problematic user account can't even do a simple DUMP. -- grunt A = LOAD '/user/myuser1/aa' AS (f1, f2); grunt DESCRIBE A; A: {f1: bytearray,f2: bytearray} grunt DUMP A; 2011-01-31 15:48:34,141 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: Store(hdfs://hadoop-name01/tmp/temp811847645/tmp1546738024:org.apache.pig.builtin.BinStorage) - 1-10 Operator Key: 1-10) 2011-01-31 15:48:34,142 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2011-01-31 15:48:34,142 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2011-01-31 15:48:34,153 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2011-01-31 15:48:35,562 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2011-01-31 15:48:35,574 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2011-01-31 15:48:35,578 [Thread-23] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2011-01-31 15:48:36,077 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2011-01-31 15:48:36,176 [Thread-23] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2011-01-31 15:48:36,176 [Thread-23] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 2011-01-31 15:48:37,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201101260357_3230 2011-01-31 15:48:37,067 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: http://hadoop-name02:50030/jobdetails.jsp?jobid=job_201101260357_3230 2011-01-31 15:48:41,605 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2011-01-31 15:48:41,605 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map reduce job(s) failed! 2011-01-31 15:48:41,607 [main]* ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed to produce result in: hdfs://hadoop-namenode01/tmp/temp811847645/tmp1546738024* 2011-01-31 15:48:41,607 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2011-01-31 15:48:41,619 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A -- While running the same LOAD, DUMP works fine with another user account. We also confirmed there is no diskspace or quota issue on namenode. Any idea? This is similar to this issue reported here: http://web.archiveorange.com/archive/v/3inw3wuad4S3zjAz89y5
wild card for all fields in a tuple
Hi, Hope there is some simple answer to this. I have bunch of rows, for each row, I want to add a column which is derived from some existing columns. And I have large number of columns in my input tuple so I don't want to repeat the name using AS when I generate. Is there an easy way just to append a column to tuples without having to touch the tuple itself on the output. Here's my example: grunt DESCRIBE X; X: {id: chararray,v1: int,v2: int} grunt DUMP X; (a,3,42) (b,2,4) (c,7,32) I can do this: grunt Y = FOREACH X GENERATE (v2 - v1) as diff, id, v1, v2; grunt DUMP Y; (39,a,3,42) (2,b,2,4) (25,c,7,32) But I would prefer not to have to list all the v's. I may have v1, v2, v3, ..., v100. Of course this doesn't work grunt Y = FOREACH X GENERATE (v2 - v1) as diff, FLATTEN(X); What can be done to simplify this? And related question, what is the schema after the FOREACH, I wish I could do a DESCRIBE after FOREACH. Thanks !!
Re: wild card for all fields in a tuple
Yeah, that works great. Thanks Jonathan and Alan. I can see that all fields in between feature will be totally useful for some cases. On Wed, Jan 12, 2011 at 3:33 PM, Alan Gates ga...@yahoo-inc.com wrote: Jonathan is right, you can do all fields in a tuple with *. I was thinking of doing all fields in between two fields, which you can't do yet. Alan. On Jan 12, 2011, at 3:18 PM, Alan Gates wrote: There isn't a way to do that yet. See https://issues.apache.org/jira/browse/PIG-1693 for our plans on adding it in the next release. Alan. On Jan 12, 2011, at 2:51 PM, Dexin Wang wrote: Hi, Hope there is some simple answer to this. I have bunch of rows, for each row, I want to add a column which is derived from some existing columns. And I have large number of columns in my input tuple so I don't want to repeat the name using AS when I generate. Is there an easy way just to append a column to tuples without having to touch the tuple itself on the output. Here's my example: grunt DESCRIBE X; X: {id: chararray,v1: int,v2: int} grunt DUMP X; (a,3,42) (b,2,4) (c,7,32) I can do this: grunt Y = FOREACH X GENERATE (v2 - v1) as diff, id, v1, v2; grunt DUMP Y; (39,a,3,42) (2,b,2,4) (25,c,7,32) But I would prefer not to have to list all the v's. I may have v1, v2, v3, ..., v100. Of course this doesn't work grunt Y = FOREACH X GENERATE (v2 - v1) as diff, FLATTEN(X); What can be done to simplify this? And related question, what is the schema after the FOREACH, I wish I could do a DESCRIBE after FOREACH. Thanks !!
how to use builtin String functions
I see there are some builtin string functions, but I don't know how to use them. I got this error when I follow the examples: grunt REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)'); 2011-01-12 19:34:23,773 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered IDENTIFIER REGEX_EXTRACT_ALL at line 1, column 1. Thanks.
Re: set reducer timeout with pig
It doesn't seem to work. I got 2010-12-22 21:36:59,120 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Unrecognized set key: mapred.task.timeout I did this in my pig script: SET mapred.task.timeout 180; What did I do wrong? Thanks. % pig --version Apache Pig version 0.7.0+9 (rexported) compiled Jun 28 2010, 12:53:50 On Tue, Dec 21, 2010 at 2:39 PM, Daniel Dai jiany...@yahoo-inc.com wrote: True, however there is one bug in 0.7. We fix it in 0.8. https://issues.apache.org/jira/browse/PIG-1760 Daniel Ashutosh Chauhan wrote: Ideally you need not to do that. Pig automatically takes care of progress reporting in its operator. Do you have a pig script which fails because of reporting progress timeout issues ? Ashutosh On Tue, Dec 21, 2010 at 13:23, Dexin Wang wangde...@gmail.com wrote: Hi, How do I change the default timeout for reducer with Pig? I have some reducer that needs to take longer than 10 minutes to finish. It is pretty frustrating to see many of get to 95% complete and then got killed. Thanks. Dexin
increment counters in Pig UDF
Is it possible to increment a counter in Pig UDF (in either Load/Eval/Store Func). Since we have access to counters using the org.apache.hadoop.mapred.Reporter: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Counters the other way to ask this question is how do we get an instance of Reporter in UDF? Thanks. Dexin
Eval UDF passing parameters
Hi, This might be a dumb question. Is it possible to pass anything other than the input tuple to a UDF Eval function? Basically in my UDF, I need to do some user info lookup. So the input will be: (userid,f1,f2) with this UDF, I want to convert it to something like (userid,age,gender,location,f1,f2) where in the UDF I do a DB lookup on the userid and returns user's info (age, gender, etc). But I don't necessarily want to pass back the same user info fields, e.g. sometimes I only want age. I hope there is a way for me to tell the UDF that I only want age, and sometimes age, location, etc. What's the best way to achieve this without having to write a separate UDF for every case? Thanks. Dexin
Re: Eval UDF passing parameters
ah nice. Thank you so much Zach! On Tue, Dec 7, 2010 at 11:47 AM, Zach Bailey zach.bai...@dataclip.comwrote: You can pass parameters via the UDF constructor. For example: public MyUDF(boolean includeAge, boolean includeGender) then you would initialize it like so in your pig script: define MY_UDF_ONLY_AGE com.package.MyUDF(true, false) and use it like: data_with_age = FOREACH data GENERATE user_id, MY_UDF_ONLY_AGE(user_id); HTH, Zach On Tuesday, December 7, 2010 at 2:44 PM, Dexin Wang wrote: Hi, This might be a dumb question. Is it possible to pass anything other than the input tuple to a UDF Eval function? Basically in my UDF, I need to do some user info lookup. So the input will be: (userid,f1,f2) with this UDF, I want to convert it to something like (userid,age,gender,location,f1,f2) where in the UDF I do a DB lookup on the userid and returns user's info (age, gender, etc). But I don't necessarily want to pass back the same user info fields, e.g. sometimes I only want age. I hope there is a way for me to tell the UDF that I only want age, and sometimes age, location, etc. What's the best way to achieve this without having to write a separate UDF for every case? Thanks. Dexin
pass configuration param to UDF
Hi all, I was reading this: http://pig.apache.org/docs/r0.7.0/udf.html#Passing+Configurations+to+UDFs It sounded like I can pass some configuration or context to the UDF but I can't figure out how I would do that after I searched quite a bit on internet and past discussion. In my UDF, I can also do this: UDFContext context = UDFContext.getUDFContext(); Properties properties = context.getUDFProperties(this.getClass()); so if the context is set on the front end, supposedly, it will be in that properties object. But how do I set it on the front end or whichever way to pass it to UDF? Thanks! Dexin