Hi Sarath,
I had a similar stack trace (see below). The problem was intermittent and only
happened when I was running on large data sets (100s of GB). I resolved the
problem by changing a GROUP BY key from a tuple of (chararray, chararray, long)
to a chararray. To do that I made the group key a
Hi Pig community,
I am running a pig process using a python UDF, and getting a failure that is
hard to debug. The relevant parts of the script are:
REGISTER [...]clustercentroid_udfs.py using jython as UDFS ;
[... definition of cluster_vals ...]
grouped = group cluster_vals by (clusters::clust
The way to debug this is to try to dump the relations named by the aliases
leading up to the failure. Try to dump rec_count. Try to dump group_data.
... You will find one that works, which will also show the first alias
definition that causes the failure.
I guess that the problem is in your
Dear Pig users,
Can Pig combine sorting and unique-ing into a single job? Doing this
--define Components, then
Sorted_0 = order Components by block_id parallel $par;
Sorted = DISTINCT Sorted_0;
causes one more MR job to be launched than simply doing this:
--define Components, then
Sorted = order
This was hard for me to get when I started using pig, and it still annoys me
after 1.5 year's experience with pig. In mathematics and logic, quantifiers
(like "for each", "there exist") bind variables that occur in their scope:
(for each x)(there exists y) [y > x]
The (for each x) binds x in (th
As far as I can tell, the python UDF I proposed is working fine. pig passes a
bag to python as a list of tuples. The implementation of random.sample is not
iterating over the input list.
I suppose if the bag were very huge then this would not work, or consume too
much memory as the argument to
Thanks Mehmet! I tried that and it seems to work on a small test case. I'm also
experimenting now with your other suggestion, a UDF.
I will probably use something like this, which seems less tricky and does not
rely on a sort:
#!/usr/bin/python
import random
@outputSchema('id_bag: {items: (item
Hi Pig users,
Is there an easy/efficient way to sample an inner bag? For example, with input
in a relation like
(id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)})
(id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)})
(id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)})
I’d like to sample 1/3 the elemen
Here is how to use rank and join for this problem:
sh cat xxx
1,2,3,4,5
1,2,4,5,7
1,5,7,8,9
sh cat yyy
10,11
10,12
10,13
a= load 'xxx' using PigStorage(',');
b= load 'yyy' using PigStorage(',');
a2 = rank a;
b2 = rank b;
c = join a1 by $0, b2 by $0;
c2 = order c by $6;
c3 = foreach c2 generat
You said "The .py code takes input from sys.stdin and outputs to sys.stdout" so
I infer you are talking about streaming, not a python UDF. In that case, rather
than streaming through your python script P.py, instead stream through a shell
script S.sh. The shell script can untar shipped or cached
http://pig.apache.org/docs/r0.12.0/basic.html#order-by says
"Pig currently supports ordering on fields with simple types or by
tuple designator (*). You cannot order on fields with complex types or by
expressions."
I think "you cannot order ... by expressions" means the behavior you see
This is one way to get employee_id and email:
A = load 'xxx.xml' using
org.apache.pig.piggybank.storage.XMLLoader('employee') as (x:chararray);
B = foreach A generate REPLACE(x,'[\\n]','') as x;
C = foreach B generate
REGEX_EXTRACT_ALL(x,'.*(?:)([^<]*).*(?:)([^<]*).*');
dump C;
But it
Your example had newlines in the element. The regular expression .*
does not match newlines. One way to remove newlines is REPLACE(x,'[\\n]','').
If the text ranges you are interested in do not contain newlines, for example
if you are interested in but do not care about its relation to
other
Ajay's suggestion will work for elements like in your example,
that occur all on one line. If you want to get the whole element,
and that spans more than one line, you will not be able to get it with matching
(.*) since that will not match a newline character.
You can remove newline character
Part of the problem might be that the regexp has
(.*)
but you need
(.*)
Using regexps to parse XML is awfully brittle. An alternative is to use a UDF
that calls out to an XML parser. I use ElementTree from python UDFs.
Will Dowling
From: Muni mahesh [m
http://www.slideshare.net/Hadoop_Summit/pig-programming-is-fun (Daniel Dai and
Thejas Nair) indicates how to use the nltk library from inside pig. nltk has
methods to compute various string distance functions, including Levenshtein.
William F Dowling
Senior Technologist
Thomson Reuters
-Or
> pig -f $ROOT_DIR/pig0.pig -param inputDatePig=$inputDate -param StartDate
> =$SDate -param EndDate=$EDate
What happens if you take out the space in
StartDate =$SDate
so that the command is
pig -f $ROOT_DIR/pig0.pig -param inputDatePig=$inputDate -param
StartDate=$SDate -param EndDat
Thanks Johnny for your reply. Working backwards: I am using MRv1. I did try
the logging suggestion you made, but did not get any other info.
So I did it the old-fashioned way with code bisection, commenting out swaths of
my script to localize the error. It turned out it was this statement:
%d
Dear pig users,
What does it mean when pig [Cloudera Pig version 0.10.0-cdh4.1.2] reports
2013-03-25 14:46:31,186 [main] INFO org.apache.pig.Main - Logging error
messages to: /proj/ac/acComponents/blocker/pig_1364237191181.log
but that file is not created? I think there are errors in my pig sc
Maybe use FILTER(): http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#FILTER
William F Dowling
Senior Technologist
Thomson Reuters
-Original Message-
From: yogesh.kuma...@wipro.com [mailto:yogesh.kuma...@wipro.com]
Sent: Wednesday, September 19, 2012 4:26 PM
To: user@pig.apache.org
Su
I just use XMLLoader to break the input xml into records, then stream that
through an xml parser to pull out what I need into the fields of a relation for
subsequent pig processing. Like
-- The analyze_src_recs.py script reads input xml from stdin, and writes to
-- stdout for each relevant p
Nested DISTINCT is a killer. See
https://mail-archives.apache.org/mod_mbox/pig-user/201201.mbox/%3ccakne9z5unw03bk2qbyxnxbsevmlcbvy1se8_7tsdmdmedhk...@mail.gmail.com%3E
for a discussion of a simple workaround that worked for me.
William F Dowling
Senior Technologist
Thomson Reuters
-Origi
I'm using org.apache.pig.piggybank.storage.XMLLoader from piggybank and that's
working well for me. I do something like this:
-- The analyze_src_recs.py script reads XML from stdin, and writes to
-- stdout comma-separated lines rec_type,...
--
define analyze_src `analyze_src_recs.py`
i
Thanks Jonathan and Prashant. The immediate cause of the problem I had (failing
without erroring out) was slightly different formatting between the small and
large input sets. Duh.
When I fixed that, I did indeed get OOM due to the nested distinct. I tried the
workaround you suggested Jonathan
I have a small pig script that outputs the top 500 of a simple computed
relation. It works fine on a small data set but fails on a larger (45 GB) data
set. I don’t see errors in the hadoop logs (but I may be looking in the wrong
places). On the large data set the pig log shows
Input(s):
Success
Here is grunt session showing a comment line being ignored correctly:
grunt> a = load 'foo' as (
>> -- a = load 'bar' as
>> a: int);
grunt>
grunt> describe a;
a: {a: int}
But when I end the comment with an open parenthesis the behavior is different:
the grunt> prompt doesn’t appear til I add an
If
-param input=s3n://foo/bar/baz/*/ blah.pig
is part of a command line, you'd have to add quotes:
-param 'input=s3n://foo/bar/baz/*/' blah.pig
to inhibit your shell from trying to interpret the *.
William F Dowling
Senior Technologist
Thomson Reuters
0 +1 215 823 3853
-Original Message
I do this:
define analyze_unif `analyze_unif_recs.py`
input (stdin)
output (stdout USING PigStreaming(','))
ship ('$scriptDir/analyze_unif_recs.py');
UnifLines = load '$unif_xml'
using org.apache.pig.piggybank.storage.XMLLoader('REC')
as (doc:chararray);
UnifXmlByDocId =
The function ‘rmf’ for forced removal is no longer (0.9.0) mentioned, except
that it is a reserved word, in the user docs for pig. It was documented in the
0.8.1. I wonder -- is this function deprecated? Will it go away in a future
version of pig? I have found rmf useful since I can’t find a w
You could use two rounds of the outer join/filter by null idiom. For example
after the first round you would get allTermsMinusNonNumbers like this:
grunt> sh cat allTerms
aa
bb
cc
11
22
33
grunt> sh cat nonNumbers
cc
grunt> allTerms = load 'allTerms' as (term:chararray);
grunt> nonNumbers = load
Thank you Thejas! Turning off the combiner let the job go to completion. Next
I can try the two-level approach to see what the performance penalty was. Kind
regards,
Will
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
-Original Message-
From: Thejas
I have a pig script that is working well for small test data sets but fails on
a run over realistic-sized data. Logs show
INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- job job_201106061024_0331 has failed!
…
job_201106061024_0331 CitedItemsGrpByDo
I do that kind of streaming on hdfs files using Hadoop streaming, outside of
pig. I assume you could do it from inside pig too, but haven’t tested.
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 215 823 3853
From: Moore, Michael A. [mailto:michael.m
Can you stream it through
grep -v ‘^#’
?
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 215 823 3853
From: Moore, Michael A. [mailto:michael.mo...@jhuapl.edu]
Sent: Tuesday, June 07, 2011 3:04 PM
To: user@pig.apache.org
Subject: Loading Files
Thanks Jonathan. I've seen other references to using -D... on the command
line, but I haven't had success with it. I tried
pig -param a=b -Dmapred.job.name=whatever myscript.pig
and the script failed and I got a usage message
Apache Pig version 0.8.1 (r1094835)
compiled Apr 18 2011, 19:26:5
Thanks Eric and Mark.
Now I see that job.name is documented in
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#set (duh). It also says
there "All Pig and Hadoop properties can be set."
Trying to figure out what exactly those properties are (is there a list
someplace?) I looked at my job
When I run a pig job the hadoop job tracker gui (the one on port 50030) shows
‘PigLatin:myscript.pig’ as the name of the job. How can I configure that to
show a different name than the name of the script?
Thanks in advance,
Will
William F Dowling
I saw this somewhere. 'Anti-join' doesn't seem very descriptive to me, but that
is what it was called.
Anti-join (set difference) idiom in pig:
A = load 'input1' as (x, y);
B = load 'input2' as (u, v);
C = cogroup A by x, B by u;
D = filter C by IsEmpty(B);
E = foreach D generate flatten(A);
W
In case anyone comes across this ...
This problem went away when I fixed a define ... ship(...)
to make sure that the file I was shipping was accessible from the running
environment on the non-local
cluster.
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 2
I have a pig script that is tested and working in local mode. But when I try
to run it in mapreduce mode on a non-local hadoop cluster I get an error with
this stack trace:
ERROR 2999: Unexpected internal error. java.lang.String cannot be cast to
org.apache.pig.data.Tuple
java.lang.ClassCastE
Hi list,
I am using pig 0.8.0, about to try a python UDF. But I can’t get the examples
from
http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs
to work, so, presumably I have some setup problem. Using hints from some other
posts on this list, I think my CLASSPATH is OK; jython itself at
Clarification: the error results after
dump AA;
The foreach{...} definition itself is not throwing the error.
William F Dowling
Sr Technical Specialist, Software Engineering
Thomson Reuters
0 +1 215 823 3853
[...]
AA = foreach TCGroupedByFuid {
FA = TCRaw.(NewCitationRel::citingdocid,
I have a relation built by grouping the join (TCRaw) of a pair of basic
relations (SrcFuid and NewCitationRel):
grunt> describe TCGroupedByFuid;
TCGroupedByFuid: {
group: (SrcFuid::citingdocid: int,
SrcFuid::col:chararray,
SrcFuid::seq: int),
TCRaw: {SrcFuid::citingdocid:
Hi Thejas,
Thanks again for your help. When I omit the SrcFuid "qualifier" and use the
form you suggest, I get this error (that was actually the reason I tried
SrcFuid. to start with.)
Pig Stack Trace
---
ERROR 1025: Found more than one match: SrcFuid::citingdocid,
NewCitationRel::
I'm a newbie, so fair warning.
Try loading each record into a single-element tuple, so each tuple is just the
text of one line. Then stream that relation through a UDF that that reads and
parses the data into standard \t or ',' separated fields. That should be no
more than a couple lines of py
-Original Message-
From: Xiaomeng Wan [mailto:shawn...@gmail.com]
Sent: Tuesday, April 05, 2011 6:54 PM
To: user@pig.apache.org
Subject: Re: Internal error 2999 - misuse of CONCAT? misuse of GROUP?
concat only takes two fields at a time. use concat(field1,
concat(field2, field3))
Shawn
-
>Do you need the group-key to be concatenated ? If not, you can just group on
>all the three columns -
>TCGroupedByFuid = group TCRaw by (SrcFuid.citingdocid,
SrcFuid.col,
SrcFuid.seq);
Hi Thejas,
I had tried that
I am a new pig user and have run into “Internal error 2999” .
2011-04-05 15:59:57,445 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
2999: Unexpected internal error. null
Details at logfile:
/proj/CitationSystem/backend/hadoop/testbed-hold/pig_1302033581143.log
That shows:
Pig Stack Tr
48 matches
Mail list logo