replicated join gets extra job

2013-11-11 Thread Dexin Wang
Hi, I'm running a job like this: raw_large = LOAD 'lots_of_files' AS (...); raw_filtered = FILTER raw_large BY ...; large_table = FOREACH raw_filtered GENERATE f1, f2, f3,; joined_1 = JOIN large_table BY (key1) LEFT, config_table_1 BY (key2) USING 'replicated'; joined_2 = JOIN join1

Re: python version with Jython/Pig

2013-07-18 Thread Dexin Wang
, Cheolsoo On Wed, Jul 17, 2013 at 3:33 PM, Dexin Wang wangde...@gmail.com wrote: When I do Python UDF with Pig, how do we know which version of Python it is using? Is it possible to use a specific version of Python? Specifically my problem is in my UDF, I need to use a function in math

python version with Jython/Pig

2013-07-17 Thread Dexin Wang
When I do Python UDF with Pig, how do we know which version of Python it is using? Is it possible to use a specific version of Python? Specifically my problem is in my UDF, I need to use a function in math module math.erf() which is newly introduced in Python version 2.7. I have Python 2.7

Re: reference tuple field by name in UDF

2013-02-01 Thread Dexin Wang
...@gmail.comwrote: Another way to do it would be to make a helper function that does the following: input.get(getInputSchema().getPosition(alias)); Only available in 0.10 and later (I think getInputSchema is in 0.10, at least...may only be in 0.11) 2013/1/15 Dexin Wang wangde...@gmail.com Hi

reference tuple field by name in UDF

2013-01-15 Thread Dexin Wang
Hi, In my own UDF, is reference a field by index the only way to access a field? The fields are all named and typed before passing into UDF but looks like I can only do something like this: String v1 = (String)input.get(0); String v2 = (String)input.get(1); String v3 =

Re: Passing a BAG to Pig UDF constructor?

2012-06-27 Thread Dexin Wang
for it to be used as a scalar What is the right way of doing this? Thanks. On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang wangde...@gmail.com wrote: That's a good idea (to pass the bag to UDF and initialize it on first UDF invocation). Thanks. Why do you think it is expensive Mridul

Passing a BAG to Pig UDF constructor?

2012-06-26 Thread Dexin Wang
Is it possible to pass a bag to a Pig UDF constructor? Basically in the constructor I want to initialize some hash map so that on every exec operation, I can use the hashmap to do a lookup and find the value I need, and apply some algorithm to it. I realize I could just do a replicated join to

Re: help!!--Does Pig can be use in this way?!

2012-04-04 Thread Dexin Wang
Or if it's simple like that, why not just grep? On Wed, Apr 4, 2012 at 7:07 AM, Corbin Hobus cor...@tynt.com wrote: If you are just finding the age of one person you are much better off using a regular database and SQL or hbase of you need some kind of quick random access. Hadoop/pig is for

Re: filter out null lines returned by UDF

2012-03-07 Thread Dexin Wang
return an empty bag and let the flatten wipe it out. 2012/3/1 Dexin Wang wangde...@gmail.com Hi, I have a UDF that parses a line and then return a bag, and sometimes the line is bad so I'm returning null in the UDF. In my pig script, I'd like to filter those nulls like this: raw

filter out null lines returned by UDF

2012-03-01 Thread Dexin Wang
Hi, I have a UDF that parses a line and then return a bag, and sometimes the line is bad so I'm returning null in the UDF. In my pig script, I'd like to filter those nulls like this: raw = LOAD 'raw_input' AS (line:chararray); parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));-- get two

Re: pig 0.9 slower in local mode?

2011-12-21 Thread Dexin Wang
, but I'm afraid hadoop-based local mode will never be quite as fast as the old local-mode... D On Mon, Dec 19, 2011 at 2:23 PM, Dexin Wang wangde...@gmail.com wrote: I recently switched to pig 0.9.1 and noticed it runs slower than previous version (like 0.6 which was only recent version

pig 0.9 slower in local mode?

2011-12-19 Thread Dexin Wang
I recently switched to pig 0.9.1 and noticed it runs slower than previous version (like 0.6 which was only recent version supported on Amazon couple of months ago) in local mode. Haven't tried the timing in hadoop mode yet. I figure it is probably due to some extra debugging or some parameter.

Re: multiple folder loading or passing comma on parameter with Amazon Pig

2011-08-18 Thread Dexin Wang
` with the back ticks, not the single quotes. On Wed, Aug 17, 2011 at 6:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Nice job figuring out a fix! You should seriously file a bug with AMR for that. That's kind of ridiculous. D On Wed, Aug 17, 2011 at 6:03 PM, Dexin Wang wangde...@gmail.com wrote: I

conditional and multiple generate inside foreach?

2011-07-22 Thread Dexin Wang
Possible to do conditional and more than one generate inside a foreach? for example, I have tuples like this (names, days_ago) (a,0) (b,1) (c,9) (d,40) b shows up 1 day ago, so it belongs to all of the following: yesterday, last week, last month, and last quarter. So I'd like to turn the above

Re: why the udf can not work

2011-06-18 Thread Dexin Wang
You need to have your class file in this path /home/huyong/test/myudfs/UPPER.class since it's in myudfs directory. On Jun 18, 2011, at 12:33 PM, 勇胡 yongyong...@gmail.com wrote: I tried your command and then it shows me as following: /home/huyong/test/UPPER.class

Re: pig script takes much longer than java MR job

2011-06-17 Thread Dexin Wang
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can take long time. I once forgot to comment out some debug line in my udf. When run with production data, not only it's slow, it blew up the cluster - simply run out of log space :) On Jun 17, 2011, at 5:06 PM,

Re: running pig on amazon ec2

2011-06-15 Thread Dexin Wang
heartbeat and make sure your jar is as small as you can get it (there's a lot of unjarring going on in Hadoop) D On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang wangde...@gmail.com wrote: Tomas, What worked well for me is still to be figured out. Right now, it works but it's too slow. I think one

Re: running pig on amazon ec2

2011-06-14 Thread Dexin Wang
a bit but the fact that running on my laptop is faster tells me this is a separate issue. Thanks! On 06/13/2011 11:54 AM, Dexin Wang wrote: Hi, This is probably not directly a Pig question. Anyone running Pig on amazon EC2 instances? Something's not making sense to me. I ran a Pig script

running pig on amazon ec2

2011-06-13 Thread Dexin Wang
Hi, This is probably not directly a Pig question. Anyone running Pig on amazon EC2 instances? Something's not making sense to me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node cluster using m1.small. It took *13 minutes*. The job reads input from S3 and writes output to S3.

Re: Setting the store file name with date

2011-05-23 Thread Dexin Wang
to rename the results of each job sequentially because my jobs can repeat many times, but their results are different. Thanks again. Renato M. 2011/5/20 Dexin Wang wangde...@gmail.com: Yeah I do that all the time. STORE result INTO 'out-$date'; Or you could run the pig script then after

Re: Setting the store file name with date

2011-05-20 Thread Dexin Wang
Yeah I do that all the time. STORE result INTO 'out-$date'; Or you could run the pig script then after it's done move the result aside. On May 20, 2011, at 6:51 PM, Renato Marroquín Mogrovejorenatoj.marroq...@gmail.com wrote: Hi, I have a sequence of jobs which are run daily and usually

elephantbird JsonLoader doesn't like gz?

2011-05-18 Thread Dexin Wang
Hi, Anyone using Twitter's elephantbird library? I was using its JsonLoader and got this error: WARN com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode string Unexpected character () at position 0. at org.json.simple.parser.Yylex.yylex(Unknown Source) at

Re: elephantbird JsonLoader doesn't like gz?

2011-05-18 Thread Dexin Wang
Or is it because I'm using Pig 0.6 where gz format is not supported? I'll run this on aws EMR which only pig 0.6 is supported. I have to use later version of Pig? On Wed, May 18, 2011 at 11:12 AM, Dexin Wang wangde...@gmail.com wrote: Hi, Anyone using Twitter's elephantbird library? I

Re: elephantbird JsonLoader doesn't like gz?

2011-05-18 Thread Dexin Wang
...@gmail.com wrote: Which version of EB are you using? I recently fixed this for someone, I believe it's been in every version since 1.2.3 D On Wed, May 18, 2011 at 11:26 AM, Dexin Wang wangde...@gmail.com wrote: Or is it because I'm using Pig 0.6 where gz format is not supported? I'll run

Re: reducer throttling?

2011-03-24 Thread Dexin Wang
questions. Alex On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang wangde...@gmail.com wrote: Can you describe a bit more about your bulk insert technique? And the way you control the number of reducers is also by adding artificial ORDER or GROUP step? Thanks! On Thu, Mar 17, 2011 at 1:33 PM

possibly Pig throttles the number of mappers

2011-03-23 Thread Dexin Wang
Hi, We've seen a strange problem where some Pig jobs would just run fewer mappers concurrently than the mapper capacity. Specifically we have a 10 node cluster and each is configured to have 12 mappers. Normally we have 120 mappers running. But for some Pig jobs it will only have 10 mappers

reducer throttling?

2011-03-17 Thread Dexin Wang
We do some processing in hadoop then as the last step, we write the result to database. Database is not good at handling hundreds of concurrent connections and fast writes. So we need to throttle down the number of tasks that writes to DB. Since we have no control on the number of mappers, we add

Re: Pig optimization getting in the way?

2011-02-22 Thread Dexin Wang
connection a member of the store function/ record writer? You can also use -no_multiquery to prevent multi-query optimization from happening, but that will also result in the MR job being executed again for other output. Thanks, Thejas On 2/18/11 4:48 PM, Dexin Wang wangde...@gmail.com

Pig optimization getting in the way?

2011-02-18 Thread Dexin Wang
I ran into a problem that I have spent quite some time on and start to think it's probably pig's doing something optimization that makes this thing hard. This is my pseudo code: raw = LOAD ... then some crazy stuff like filter join group UDF etc A = the result from above operation STORE A INTO

Re: Pig optimization getting in the way?

2011-02-18 Thread Dexin Wang
wrote: Let me guess -- you have a static JDBC connection that you open in myJDBC, and you have jvm reuse turned on. On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang wangde...@gmail.com wrote: I ran into a problem that I have spent quite some time on and start to think it's probably pig's

Re: Use Filename in Tuple

2011-02-03 Thread Dexin Wang
Similarly, is it possible to insert some literal values to a tuple stream? For example, when I invoke my Pig script, I already know what data source is (say, it's from filename_2011-02-03), so I can just pass it to Pig using -param, and I want to insert this known file name to the tuple stream.

Re: Use Filename in Tuple

2011-02-03 Thread Dexin Wang
, Feb 3, 2011 at 8:32 PM, Dexin Wang wangde...@gmail.com wrote: Similarly, is it possible to insert some literal values to a tuple stream? For example, when I invoke my Pig script, I already know what data source is (say, it's from filename_2011-02-03), so I can just pass it to Pig using

Re: failed to produce result

2011-01-31 Thread Dexin Wang
* -Thejas On 1/31/11 1:54 PM, Dexin Wang wangde...@gmail.com wrote: Hi, I found similar problems on the web but didn't find a solution for it so I'm asking here. I have some pig job that has been working fine for couple of months and it started failing. But the same job still works if run

wild card for all fields in a tuple

2011-01-12 Thread Dexin Wang
Hi, Hope there is some simple answer to this. I have bunch of rows, for each row, I want to add a column which is derived from some existing columns. And I have large number of columns in my input tuple so I don't want to repeat the name using AS when I generate. Is there an easy way just to

Re: wild card for all fields in a tuple

2011-01-12 Thread Dexin Wang
fields in between two fields, which you can't do yet. Alan. On Jan 12, 2011, at 3:18 PM, Alan Gates wrote: There isn't a way to do that yet. See https://issues.apache.org/jira/browse/PIG-1693 for our plans on adding it in the next release. Alan. On Jan 12, 2011, at 2:51 PM, Dexin

how to use builtin String functions

2011-01-12 Thread Dexin Wang
I see there are some builtin string functions, but I don't know how to use them. I got this error when I follow the examples: grunt REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)'); 2011-01-12 19:34:23,773 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing.

Re: set reducer timeout with pig

2010-12-22 Thread Dexin Wang
need not to do that. Pig automatically takes care of progress reporting in its operator. Do you have a pig script which fails because of reporting progress timeout issues ? Ashutosh On Tue, Dec 21, 2010 at 13:23, Dexin Wang wangde...@gmail.com wrote: Hi, How do I change the default timeout

increment counters in Pig UDF

2010-12-15 Thread Dexin Wang
Is it possible to increment a counter in Pig UDF (in either Load/Eval/Store Func). Since we have access to counters using the org.apache.hadoop.mapred.Reporter: http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Counters the other way to ask this question is how do we get an

Eval UDF passing parameters

2010-12-07 Thread Dexin Wang
Hi, This might be a dumb question. Is it possible to pass anything other than the input tuple to a UDF Eval function? Basically in my UDF, I need to do some user info lookup. So the input will be: (userid,f1,f2) with this UDF, I want to convert it to something like

Re: Eval UDF passing parameters

2010-12-07 Thread Dexin Wang
: define MY_UDF_ONLY_AGE com.package.MyUDF(true, false) and use it like: data_with_age = FOREACH data GENERATE user_id, MY_UDF_ONLY_AGE(user_id); HTH, Zach On Tuesday, December 7, 2010 at 2:44 PM, Dexin Wang wrote: Hi, This might be a dumb question. Is it possible to pass anything

pass configuration param to UDF

2010-11-23 Thread Dexin Wang
Hi all, I was reading this: http://pig.apache.org/docs/r0.7.0/udf.html#Passing+Configurations+to+UDFs It sounded like I can pass some configuration or context to the UDF but I can't figure out how I would do that after I searched quite a bit on internet and past discussion. In my UDF, I can