Hi,
I'm running a job like this:
raw_large = LOAD 'lots_of_files' AS (...);
raw_filtered = FILTER raw_large BY ...;
large_table = FOREACH raw_filtered GENERATE f1, f2, f3,;
joined_1 = JOIN large_table BY (key1) LEFT, config_table_1 BY (key2) USING
'replicated';
joined_2 = JOIN join1
,
Cheolsoo
On Wed, Jul 17, 2013 at 3:33 PM, Dexin Wang wangde...@gmail.com wrote:
When I do Python UDF with Pig, how do we know which version of Python it
is
using? Is it possible to use a specific version of Python?
Specifically my problem is in my UDF, I need to use a function in math
When I do Python UDF with Pig, how do we know which version of Python it is
using? Is it possible to use a specific version of Python?
Specifically my problem is in my UDF, I need to use a function in math
module math.erf() which is newly introduced in Python version 2.7. I have
Python 2.7
...@gmail.comwrote:
Another way to do it would be to make a helper function that does the
following:
input.get(getInputSchema().getPosition(alias));
Only available in 0.10 and later (I think getInputSchema is in 0.10, at
least...may only be in 0.11)
2013/1/15 Dexin Wang wangde...@gmail.com
Hi
Hi,
In my own UDF, is reference a field by index the only way to access a field?
The fields are all named and typed before passing into UDF but looks like I
can only do something like this:
String v1 = (String)input.get(0);
String v2 = (String)input.get(1);
String v3 =
for it to be used as a scalar
What is the right way of doing this? Thanks.
On Wed, Jun 27, 2012 at 10:30 AM, Dexin Wang wangde...@gmail.com wrote:
That's a good idea (to pass the bag to UDF and initialize it on first UDF
invocation). Thanks.
Why do you think it is expensive Mridul
Is it possible to pass a bag to a Pig UDF constructor?
Basically in the constructor I want to initialize some hash map so that on
every exec operation, I can use the hashmap to do a lookup and find the
value I need, and apply some algorithm to it.
I realize I could just do a replicated join to
Or if it's simple like that, why not just grep?
On Wed, Apr 4, 2012 at 7:07 AM, Corbin Hobus cor...@tynt.com wrote:
If you are just finding the age of one person you are much better off
using a regular database and SQL or hbase of you need some kind of quick
random access.
Hadoop/pig is for
return an empty bag and let the flatten wipe it out.
2012/3/1 Dexin Wang wangde...@gmail.com
Hi,
I have a UDF that parses a line and then return a bag, and sometimes the
line is bad so I'm returning null in the UDF. In my pig script, I'd like
to
filter those nulls like this:
raw
Hi,
I have a UDF that parses a line and then return a bag, and sometimes the
line is bad so I'm returning null in the UDF. In my pig script, I'd like to
filter those nulls like this:
raw = LOAD 'raw_input' AS (line:chararray);
parsed = FOREACH raw GENERATE FLATTEN(MyUDF(line));-- get two
, but I'm afraid
hadoop-based local mode will never be quite as fast as the old
local-mode...
D
On Mon, Dec 19, 2011 at 2:23 PM, Dexin Wang wangde...@gmail.com wrote:
I recently switched to pig 0.9.1 and noticed it runs slower than previous
version (like 0.6 which was only recent version
I recently switched to pig 0.9.1 and noticed it runs slower than previous
version (like 0.6 which was only recent version supported on Amazon couple
of months ago) in local mode. Haven't tried the timing in hadoop mode yet.
I figure it is probably due to some extra debugging or some parameter.
` with the back ticks, not the single
quotes.
On Wed, Aug 17, 2011 at 6:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
Nice job figuring out a fix!
You should seriously file a bug with AMR for that. That's kind of
ridiculous.
D
On Wed, Aug 17, 2011 at 6:03 PM, Dexin Wang wangde...@gmail.com wrote:
I
Possible to do conditional and more than one generate inside a foreach?
for example, I have tuples like this (names, days_ago)
(a,0)
(b,1)
(c,9)
(d,40)
b shows up 1 day ago, so it belongs to all of the following: yesterday, last
week, last month, and last quarter. So I'd like to turn the above
You need to have your class file in this path
/home/huyong/test/myudfs/UPPER.class
since it's in myudfs directory.
On Jun 18, 2011, at 12:33 PM, 勇胡 yongyong...@gmail.com wrote:
I tried your command and then it shows me as following:
/home/huyong/test/UPPER.class
Yeah sounds like a lot to dump if it takes 15 minutes to run. That alone can
take long time.
I once forgot to comment out some debug line in my udf. When run with
production data, not only it's slow, it blew up the cluster - simply run out of
log space :)
On Jun 17, 2011, at 5:06 PM,
heartbeat and make sure your jar is as small
as you can get it (there's a lot of unjarring going on in Hadoop)
D
On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang wangde...@gmail.com wrote:
Tomas,
What worked well for me is still to be figured out. Right now, it works
but
it's too slow. I think one
a bit but the fact that
running on my laptop is faster tells me this is a separate issue.
Thanks!
On 06/13/2011 11:54 AM, Dexin Wang wrote:
Hi,
This is probably not directly a Pig question.
Anyone running Pig on amazon EC2 instances? Something's not making sense
to
me. I ran a Pig script
Hi,
This is probably not directly a Pig question.
Anyone running Pig on amazon EC2 instances? Something's not making sense to
me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
cluster using m1.small. It took *13 minutes*. The job reads input from S3
and writes output to S3.
to rename the results of each job sequentially because my jobs can
repeat many times, but their results are different.
Thanks again.
Renato M.
2011/5/20 Dexin Wang wangde...@gmail.com:
Yeah I do that all the time.
STORE result INTO 'out-$date';
Or you could run the pig script then after
Yeah I do that all the time.
STORE result INTO 'out-$date';
Or you could run the pig script then after it's done move the result aside.
On May 20, 2011, at 6:51 PM, Renato Marroquín
Mogrovejorenatoj.marroq...@gmail.com wrote:
Hi, I have a sequence of jobs which are run daily and usually
Hi,
Anyone using Twitter's elephantbird library? I was using its JsonLoader and
got this error:
WARN com.twitter.elephantbird.pig.load.JsonLoader - Could not json-decode
string
Unexpected character () at position 0.
at org.json.simple.parser.Yylex.yylex(Unknown Source)
at
Or is it because I'm using Pig 0.6 where gz format is not supported? I'll
run this on aws EMR which only pig 0.6 is supported. I have to use later
version of Pig?
On Wed, May 18, 2011 at 11:12 AM, Dexin Wang wangde...@gmail.com wrote:
Hi,
Anyone using Twitter's elephantbird library? I
...@gmail.com wrote:
Which version of EB are you using? I recently fixed this for someone,
I believe it's been in every version since 1.2.3
D
On Wed, May 18, 2011 at 11:26 AM, Dexin Wang wangde...@gmail.com wrote:
Or is it because I'm using Pig 0.6 where gz format is not supported? I'll
run
questions.
Alex
On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang wangde...@gmail.com wrote:
Can you describe a bit more about your bulk insert technique? And the way
you control the number of reducers is also by adding artificial ORDER or
GROUP step?
Thanks!
On Thu, Mar 17, 2011 at 1:33 PM
Hi,
We've seen a strange problem where some Pig jobs would just run fewer
mappers concurrently than the mapper capacity. Specifically we have a 10
node cluster and each is configured to have 12 mappers. Normally we have 120
mappers running. But for some Pig jobs it will only have 10 mappers
We do some processing in hadoop then as the last step, we write the result
to database. Database is not good at handling hundreds of concurrent
connections and fast writes. So we need to throttle down the number of tasks
that writes to DB. Since we have no control on the number of mappers, we add
connection a member of the store function/ record
writer?
You can also use -no_multiquery to prevent multi-query optimization from
happening, but that will also result in the MR job being executed again for
other output.
Thanks,
Thejas
On 2/18/11 4:48 PM, Dexin Wang wangde...@gmail.com
I ran into a problem that I have spent quite some time on and start to think
it's probably pig's doing something optimization that makes this thing hard.
This is my pseudo code:
raw = LOAD ...
then some crazy stuff like
filter
join
group
UDF
etc
A = the result from above operation
STORE A INTO
wrote:
Let me guess -- you have a static JDBC connection that you open in myJDBC,
and you have jvm reuse turned on.
On Fri, Feb 18, 2011 at 1:41 PM, Dexin Wang wangde...@gmail.com wrote:
I ran into a problem that I have spent quite some time on and start to
think
it's probably pig's
Similarly, is it possible to insert some literal values to a tuple stream?
For example, when I invoke my Pig script, I already know what data source is
(say, it's from filename_2011-02-03), so I can just pass it to Pig using
-param, and I want to insert this known file name to the tuple stream.
, Feb 3, 2011 at 8:32 PM, Dexin Wang wangde...@gmail.com wrote:
Similarly, is it possible to insert some literal values to a tuple
stream?
For example, when I invoke my Pig script, I already know what data source
is
(say, it's from filename_2011-02-03), so I can just pass it to Pig using
*
-Thejas
On 1/31/11 1:54 PM, Dexin Wang wangde...@gmail.com wrote:
Hi,
I found similar problems on the web but didn't find a solution for it so
I'm
asking here.
I have some pig job that has been working fine for couple of months and it
started failing. But the same job still works if run
Hi,
Hope there is some simple answer to this. I have bunch of rows, for each
row, I want to add a column which is derived from some existing columns. And
I have large number of columns in my input tuple so I don't want to repeat
the name using AS when I generate. Is there an easy way just to
fields in between two fields, which you can't do yet.
Alan.
On Jan 12, 2011, at 3:18 PM, Alan Gates wrote:
There isn't a way to do that yet. See
https://issues.apache.org/jira/browse/PIG-1693
for our plans on adding it in the next release.
Alan.
On Jan 12, 2011, at 2:51 PM, Dexin
I see there are some builtin string functions, but I don't know how to use
them. I got this error when I follow the examples:
grunt REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');
2011-01-12 19:34:23,773 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing.
need not to do that. Pig automatically takes care of
progress reporting in its operator. Do you have a pig script which
fails because of reporting progress timeout issues ?
Ashutosh
On Tue, Dec 21, 2010 at 13:23, Dexin Wang wangde...@gmail.com wrote:
Hi,
How do I change the default timeout
Is it possible to increment a counter in Pig UDF (in either Load/Eval/Store
Func).
Since we have access to counters using the
org.apache.hadoop.mapred.Reporter:
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Counters
the other way to ask this question is how do we get an
Hi,
This might be a dumb question. Is it possible to pass anything other than
the input tuple to a UDF Eval function?
Basically in my UDF, I need to do some user info lookup. So the input will
be:
(userid,f1,f2)
with this UDF, I want to convert it to something like
:
define MY_UDF_ONLY_AGE com.package.MyUDF(true, false)
and use it like:
data_with_age = FOREACH data GENERATE user_id, MY_UDF_ONLY_AGE(user_id);
HTH,
Zach
On Tuesday, December 7, 2010 at 2:44 PM, Dexin Wang wrote:
Hi,
This might be a dumb question. Is it possible to pass anything
Hi all,
I was reading this:
http://pig.apache.org/docs/r0.7.0/udf.html#Passing+Configurations+to+UDFs
It sounded like I can pass some configuration or context to the UDF but I
can't figure out how I would do that after I searched quite a bit on
internet and past discussion.
In my UDF, I can
41 matches
Mail list logo