RE: Array Out Of Bounds Exception in Nested foreach

2016-05-06 Thread william.dowling
Hi Sarath, I had a similar stack trace (see below). The problem was intermittent and only happened when I was running on large data sets (100s of GB). I resolved the problem by changing a GROUP BY key from a tuple of (chararray, chararray, long) to a chararray. To do that I made the group key a

python UDF invocation or memory problem

2015-12-10 Thread william.dowling
Hi Pig community, I am running a pig process using a python UDF, and getting a failure that is hard to debug. The relevant parts of the script are: REGISTER [...]clustercentroid_udfs.py using jython as UDFS ; [... definition of cluster_vals ...] grouped = group cluster_vals by (clusters::clust

RE: Getting this exception in pig

2015-09-01 Thread william.dowling
The way to debug this is to try to dump the relations named by the aliases leading up to the failure. Try to dump rec_count. Try to dump group_data. ... You will find one that works, which will also show the first alias definition that causes the failure. I guess that the problem is in your

"order by" and "distinct" in one job?

2015-06-03 Thread william.dowling
Dear Pig users, Can Pig combine sorting and unique-ing into a single job? Doing this --define Components, then Sorted_0 = order Components by block_id parallel $par; Sorted = DISTINCT Sorted_0; causes one more MR job to be launched than simply doing this: --define Components, then Sorted = order

RE: Problem in understanding UDF COUNT

2014-07-21 Thread william.dowling
This was hard for me to get when I started using pig, and it still annoys me after 1.5 year's experience with pig. In mathematics and logic, quantifiers (like "for each", "there exist") bind variables that occur in their scope: (for each x)(there exists y) [y > x] The (for each x) binds x in (th

RE: How to sample an inner bag?

2014-05-29 Thread william.dowling
As far as I can tell, the python UDF I proposed is working fine. pig passes a bag to python as a list of tuples. The implementation of random.sample is not iterating over the input list. I suppose if the bag were very huge then this would not work, or consume too much memory as the argument to

RE: How to sample an inner bag?

2014-05-28 Thread william.dowling
Thanks Mehmet! I tried that and it seems to work on a small test case. I'm also experimenting now with your other suggestion, a UDF. I will probably use something like this, which seems less tricky and does not rely on a sort: #!/usr/bin/python import random @outputSchema('id_bag: {items: (item

How to sample an inner bag?

2014-05-27 Thread william.dowling
Hi Pig users, Is there an easy/efficient way to sample an inner bag? For example, with input in a relation like (id1,att1,{(a,0.01),(b,0.02),(x,0.999749968742)}) (id1,att2,{(a,0.03),(b,0.04),(x,0.998749217772)}) (id2,att1,{(b,0.05),(c,0.06),(x,0.996945334509)}) I’d like to sample 1/3 the elemen

RE: Any way to join two aliases without using CROSS

2014-03-25 Thread william.dowling
Here is how to use rank and join for this problem: sh cat xxx 1,2,3,4,5 1,2,4,5,7 1,5,7,8,9 sh cat yyy 10,11 10,12 10,13 a= load 'xxx' using PigStorage(','); b= load 'yyy' using PigStorage(','); a2 = rank a; b2 = rank b; c = join a1 by $0, b2 by $0; c2 = order c by $6; c3 = foreach c2 generat

RE: Need example of python code with dependency files

2013-11-06 Thread william.dowling
You said "The .py code takes input from sys.stdin and outputs to sys.stdout" so I infer you are talking about streaming, not a python UDF. In that case, rather than streaming through your python script P.py, instead stream through a shell script S.sh. The shell script can untar shipped or cached

RE: ORDER BY a map value fails with a syntax error - pig bug?

2013-10-29 Thread william.dowling
http://pig.apache.org/docs/r0.12.0/basic.html#order-by says "Pig currently supports ordering on fields with simple types or by tuple designator (*). You cannot order on fields with complex types or by expressions." I think "you cannot order ... by expressions" means the behavior you see

RE: Converting xml to csv

2013-09-17 Thread william.dowling
This is one way to get employee_id and email: A = load 'xxx.xml' using org.apache.pig.piggybank.storage.XMLLoader('employee') as (x:chararray); B = foreach A generate REPLACE(x,'[\\n]','') as x; C = foreach B generate REGEX_EXTRACT_ALL(x,'.*(?:)([^<]*).*(?:)([^<]*).*'); dump C; But it

RE: Converting xml to csv

2013-09-16 Thread william.dowling
Your example had newlines in the element. The regular expression .* does not match newlines. One way to remove newlines is REPLACE(x,'[\\n]',''). If the text ranges you are interested in do not contain newlines, for example if you are interested in but do not care about its relation to other

RE: Converting xml to csv

2013-09-13 Thread william.dowling
Ajay's suggestion will work for elements like in your example, that occur all on one line. If you want to get the whole element, and that spans more than one line, you will not be able to get it with matching (.*) since that will not match a newline character. You can remove newline character

RE: can't parse the values using XML loader

2013-08-21 Thread william.dowling
Part of the problem might be that the regexp has (.*) but you need (.*) Using regexps to parse XML is awfully brittle. An alternative is to use a UDF that calls out to an XML parser. I use ElementTree from python UDFs. Will Dowling From: Muni mahesh [m

RE: fuzzy logic through pig programming

2013-06-27 Thread william.dowling
http://www.slideshare.net/Hadoop_Summit/pig-programming-is-fun (Daniel Dai and Thejas Nair) indicates how to use the nltk library from inside pig. nltk has methods to compute various string distance functions, including Levenshtein. William F Dowling Senior Technologist Thomson Reuters -Or

RE: Passing multiple parameters to a PIG script

2013-06-20 Thread william.dowling
> pig -f $ROOT_DIR/pig0.pig -param inputDatePig=$inputDate -param StartDate > =$SDate -param EndDate=$EDate What happens if you take out the space in StartDate =$SDate so that the command is pig -f $ROOT_DIR/pig0.pig -param inputDatePig=$inputDate -param StartDate=$SDate -param EndDat

RE: missing error log

2013-03-25 Thread william.dowling
Thanks Johnny for your reply. Working backwards: I am using MRv1. I did try the logging suggestion you made, but did not get any other info. So I did it the old-fashioned way with code bisection, commenting out swaths of my script to localize the error. It turned out it was this statement: %d

missing error log

2013-03-25 Thread william.dowling
Dear pig users, What does it mean when pig [Cloudera Pig version 0.10.0-cdh4.1.2] reports 2013-03-25 14:46:31,186 [main] INFO org.apache.pig.Main - Logging error messages to: /proj/ac/acComponents/blocker/pig_1364237191181.log but that file is not created? I think there are errors in my pig sc

RE: How Select the top records

2012-09-19 Thread william.dowling
Maybe use FILTER(): http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#FILTER William F Dowling Senior Technologist Thomson Reuters -Original Message- From: yogesh.kuma...@wipro.com [mailto:yogesh.kuma...@wipro.com] Sent: Wednesday, September 19, 2012 4:26 PM To: user@pig.apache.org Su

RE: Parsing XML using PIG

2012-04-20 Thread william.dowling
I just use XMLLoader to break the input xml into records, then stream that through an xml parser to pull out what I need into the fields of a relation for subsequent pig processing. Like -- The analyze_src_recs.py script reads input xml from stdin, and writes to -- stdout for each relevant p

RE: 0.9.1 out of memory problem

2012-01-18 Thread william.dowling
Nested DISTINCT is a killer. See https://mail-archives.apache.org/mod_mbox/pig-user/201201.mbox/%3ccakne9z5unw03bk2qbyxnxbsevmlcbvy1se8_7tsdmdmedhk...@mail.gmail.com%3E for a discussion of a simple workaround that worked for me. William F Dowling Senior Technologist Thomson Reuters -Origi

RE: Custom Loaders that use Input Streams for reading data?

2012-01-13 Thread william.dowling
I'm using org.apache.pig.piggybank.storage.XMLLoader from piggybank and that's working well for me. I do something like this: -- The analyze_src_recs.py script reads XML from stdin, and writes to -- stdout comma-separated lines rec_type,... -- define analyze_src `analyze_src_recs.py` i

RE: ORDER ... LIMIT failing on large data

2012-01-06 Thread william.dowling
Thanks Jonathan and Prashant. The immediate cause of the problem I had (failing without erroring out) was slightly different formatting between the small and large input sets. Duh. When I fixed that, I did indeed get OOM due to the nested distinct. I tried the workaround you suggested Jonathan

ORDER ... LIMIT failing on large data

2012-01-05 Thread william.dowling
I have a small pig script that outputs the top 500 of a simple computed relation. It works fine on a small data set but fails on a larger (45 GB) data set. I don’t see errors in the hadoop logs (but I may be looking in the wrong places). On the large data set the pig log shows Input(s): Success

grunt mishandles open parenthesis in a comment

2012-01-05 Thread william.dowling
Here is grunt session showing a comment line being ignored correctly: grunt> a = load 'foo' as ( >> -- a = load 'bar' as >> a: int); grunt> grunt> describe a; a: {a: int} But when I end the comment with an open parenthesis the behavior is different: the grunt> prompt doesn’t appear til I add an

RE: Possible Pig 9.1 globing bug in parameter substitution

2011-12-15 Thread william.dowling
If -param input=s3n://foo/bar/baz/*/ blah.pig is part of a command line, you'd have to add quotes: -param 'input=s3n://foo/bar/baz/*/' blah.pig to inhibit your shell from trying to interpret the *. William F Dowling Senior Technologist Thomson Reuters 0 +1 215 823 3853 -Original Message

RE: reading xml file within a UDF

2011-09-14 Thread william.dowling
I do this: define analyze_unif `analyze_unif_recs.py` input (stdin) output (stdout USING PigStreaming(',')) ship ('$scriptDir/analyze_unif_recs.py'); UnifLines = load '$unif_xml' using org.apache.pig.piggybank.storage.XMLLoader('REC') as (doc:chararray); UnifXmlByDocId =

rmf for forrced rm

2011-08-12 Thread william.dowling
The function ‘rmf’ for forced removal is no longer (0.9.0) mentioned, except that it is a reserved word, in the user docs for pig. It was documented in the 0.8.1. I wonder -- is this function deprecated? Will it go away in a future version of pig? I have found rmf useful since I can’t find a w

RE: Manually build tuple from three group relations

2011-07-07 Thread william.dowling
You could use two rounds of the outer join/filter by null idiom. For example after the first round you would get allTermsMinusNonNumbers like this: grunt> sh cat allTerms aa bb cc 11 22 33 grunt> sh cat nonNumbers cc grunt> allTerms = load 'allTerms' as (term:chararray); grunt> nonNumbers = load

RE: workaround for java.lang.OutOfMemoryError: Java heap space?

2011-06-10 Thread william.dowling
Thank you Thejas! Turning off the combiner let the job go to completion. Next I can try the two-level approach to see what the performance penalty was. Kind regards, Will William F Dowling Sr Technical Specialist, Software Engineering Thomson Reuters -Original Message- From: Thejas

workaround for java.lang.OutOfMemoryError: Java heap space?

2011-06-10 Thread william.dowling
I have a pig script that is working well for small test data sets but fails on a run over realistic-sized data. Logs show INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201106061024_0331 has failed! … job_201106061024_0331 CitedItemsGrpByDo

RE: Loading Files with Comment Lines

2011-06-07 Thread william.dowling
I do that kind of streaming on hdfs files using Hadoop streaming, outside of pig. I assume you could do it from inside pig too, but haven’t tested. William F Dowling Sr Technical Specialist, Software Engineering Thomson Reuters 0 +1 215 823 3853 From: Moore, Michael A. [mailto:michael.m

RE: Loading Files with Comment Lines

2011-06-07 Thread william.dowling
Can you stream it through grep -v ‘^#’ ? William F Dowling Sr Technical Specialist, Software Engineering Thomson Reuters 0 +1 215 823 3853 From: Moore, Michael A. [mailto:michael.mo...@jhuapl.edu] Sent: Tuesday, June 07, 2011 3:04 PM To: user@pig.apache.org Subject: Loading Files

RE: Set visible name of a running pig job

2011-05-26 Thread william.dowling
Thanks Jonathan. I've seen other references to using -D... on the command line, but I haven't had success with it. I tried pig -param a=b -Dmapred.job.name=whatever myscript.pig and the script failed and I got a usage message Apache Pig version 0.8.1 (r1094835) compiled Apr 18 2011, 19:26:5

RE: Set visible name of a running pig job

2011-05-26 Thread william.dowling
Thanks Eric and Mark. Now I see that job.name is documented in http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#set (duh). It also says there "All Pig and Hadoop properties can be set." Trying to figure out what exactly those properties are (is there a list someplace?) I looked at my job

Set visible name of a running pig job

2011-05-26 Thread william.dowling
When I run a pig job the hadoop job tracker gui (the one on port 50030) shows ‘PigLatin:myscript.pig’ as the name of the job. How can I configure that to show a different name than the name of the script? Thanks in advance, Will William F Dowling

RE: Set difference in Pig

2011-05-12 Thread william.dowling
I saw this somewhere. 'Anti-join' doesn't seem very descriptive to me, but that is what it was called. Anti-join (set difference) idiom in pig: A = load 'input1' as (x, y); B = load 'input2' as (u, v); C = cogroup A by x, B by u; D = filter C by IsEmpty(B); E = foreach D generate flatten(A); W

RE: ERROR: String cannot be cast to org.apache.pig.data.Tuple

2011-05-06 Thread william.dowling
In case anyone comes across this ... This problem went away when I fixed a define ... ship(...) to make sure that the file I was shipping was accessible from the running environment on the non-local cluster. William F Dowling Sr Technical Specialist, Software Engineering Thomson Reuters 0 +1 2

ERROR: String cannot be cast to org.apache.pig.data.Tuple

2011-05-06 Thread william.dowling
I have a pig script that is tested and working in local mode. But when I try to run it in mapreduce mode on a non-local hadoop cluster I get an error with this stack trace: ERROR 2999: Unexpected internal error. java.lang.String cannot be cast to org.apache.pig.data.Tuple java.lang.ClassCastE

ERROR 2999 when trying sample python script UDFs from UDF manual

2011-05-02 Thread william.dowling
Hi list, I am using pig 0.8.0, about to try a python UDF. But I can’t get the examples from http://pig.apache.org/docs/r0.8.0/udf.html#Python+UDFs to work, so, presumably I have some setup problem. Using hints from some other posts on this list, I think my CLASSPATH is OK; jython itself at

RE: Projecting on a pair of columns inside FOREACH() gives error 2213

2011-04-07 Thread william.dowling
Clarification: the error results after dump AA; The foreach{...} definition itself is not throwing the error. William F Dowling Sr Technical Specialist, Software Engineering Thomson Reuters 0 +1 215 823 3853 [...] AA = foreach TCGroupedByFuid { FA = TCRaw.(NewCitationRel::citingdocid,

Projecting on a pair of columns inside FOREACH() gives error 2213

2011-04-07 Thread william.dowling
I have a relation built by grouping the join (TCRaw) of a pair of basic relations (SrcFuid and NewCitationRel): grunt> describe TCGroupedByFuid; TCGroupedByFuid: { group: (SrcFuid::citingdocid: int, SrcFuid::col:chararray, SrcFuid::seq: int), TCRaw: {SrcFuid::citingdocid:

RE: Internal error 2999 - misuse of CONCAT? misuse of GROUP?

2011-04-06 Thread william.dowling
Hi Thejas, Thanks again for your help. When I omit the SrcFuid "qualifier" and use the form you suggest, I get this error (that was actually the reason I tried SrcFuid. to start with.) Pig Stack Trace --- ERROR 1025: Found more than one match: SrcFuid::citingdocid, NewCitationRel::

RE: Processing fixed length records with Pig

2011-04-06 Thread william.dowling
I'm a newbie, so fair warning. Try loading each record into a single-element tuple, so each tuple is just the text of one line. Then stream that relation through a UDF that that reads and parses the data into standard \t or ',' separated fields. That should be no more than a couple lines of py

RE: Internal error 2999 - misuse of CONCAT? misuse of GROUP?

2011-04-06 Thread william.dowling
-Original Message- From: Xiaomeng Wan [mailto:shawn...@gmail.com] Sent: Tuesday, April 05, 2011 6:54 PM To: user@pig.apache.org Subject: Re: Internal error 2999 - misuse of CONCAT? misuse of GROUP? concat only takes two fields at a time. use concat(field1, concat(field2, field3)) Shawn -

RE: Internal error 2999 - misuse of CONCAT? misuse of GROUP?

2011-04-06 Thread william.dowling
>Do you need the group-key to be concatenated ? If not, you can just group on >all the three columns - >TCGroupedByFuid = group TCRaw by (SrcFuid.citingdocid, SrcFuid.col, SrcFuid.seq); Hi Thejas, I had tried that

Internal error 2999 - misuse of CONCAT? misuse of GROUP?

2011-04-05 Thread william.dowling
I am a new pig user and have run into “Internal error 2999” . 2011-04-05 15:59:57,445 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null Details at logfile: /proj/CitationSystem/backend/hadoop/testbed-hold/pig_1302033581143.log That shows: Pig Stack Tr