pig error message enhancement

2012-07-03 Thread Haitao Yao
hi, all I encountered an Exception like this: ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. null at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1641) at org.apache.pig.PigServe

Re: Best Practice: store depending on data content

2012-07-03 Thread Dmitriy Ryaboy
Imagine increasing the number of datasets by a couple orders of magnitude. "ls" stops being a good browsing too pretty quickly. Then, add the need to manage quotas and retention policies for different data producers, to find resources across multiple teams, to have a web ui for easy metadata searc

Re: One file with sorted results.

2012-07-03 Thread sonia gehlot
Thanks Alan, I will try this. -Sonia On Tue, Jul 3, 2012 at 7:56 AM, Alan Gates wrote: > You can set different parallel levels at different parts of your script by > attaching parallel to the different operations. For example: > > Y = join W by a, X by b parallel 100; > Z = order Y by a paral

Re: for UDF, figure out whether it's on a task tracker?

2012-07-03 Thread Yang
I see, thanks Jonathan. On Tue, Jul 3, 2012 at 10:01 AM, Jonathan Coveney wrote: > UDF's are instantiated at job construction time a couple of times in order > to inspect various properties about them. This is subideal, but alas. I > generally lazily initialize in exec, as that is only called on

Re: for UDF, figure out whether it's on a task tracker?

2012-07-03 Thread Jonathan Coveney
UDF's are instantiated at job construction time a couple of times in order to inspect various properties about them. This is subideal, but alas. I generally lazily initialize in exec, as that is only called on the mapper/reducer. The lifecycle of UDF's can be a bit confusing in this way. 2012/7/3

Re: How to find ranges between a field in a set of records

2012-07-03 Thread Jonathan Coveney
There is not a way to do this in straight pig, but it is easy with a UDF (ideally an accumulative UDF though if there are <100 records per key it doesn't really matter). You'll do a nested sort in a foreach block then pass the dates to the UDF. The docs should have an example of this. 2012/7/2 Bob

Re: What is the best way to do counting in pig?

2012-07-03 Thread Jonathan Coveney
instead of doing "dump relation," do "explain relation" (then run identically) and paste the output here. It will show whether the combiner is being used, 2012/7/3 Ruslan Al-Fakikh > Hi, > > As it was said, COUNT is algebraic and should be fast, because it > forces combiner. You should make sure

Re: Pig 0.10.0 and Hadoop 0.23

2012-07-03 Thread Johannes Schwenk
Thanks. Well, I did that and rebuild all pig jars with -Dhadoopversion=23 but am still getting errors - this time around it's errors with opening an iterator on alias b again, sigh... I have singled out one of the similarly failing tests and pasted the output: http://pastebin.com/reKjcX0p The t

Re: One file with sorted results.

2012-07-03 Thread Alan Gates
You can set different parallel levels at different parts of your script by attaching parallel to the different operations. For example: Y = join W by a, X by b parallel 100; Z = order Y by a parallel 1; store Z into 'onefile'; If your output is big I would suggest trying out ordering in paralle

How to find ranges between a field in a set of records

2012-07-03 Thread Bob Briski
Hi, I need to determine the number of days between dates on a running list of records. The records associated with each key will be small (less than 100) I should be able to do it in one reducer. The data would look something like this: Say the headers are: player_id, date, other_stuff values

Re: Best Practice: store depending on data content

2012-07-03 Thread Markus Resch
In our case we have /result/CustomerId1 /result/CustomerId2 /result/CustomerId3 /result/CustomerId4 [...] As we have a _lot_ of customers ;) we don't want to make an extra line of code to each script. I think the MultiStorage is perfect for our use case but we need to extend it for avro usage. B

Re: What is the best way to do counting in pig?

2012-07-03 Thread Ruslan Al-Fakikh
Hi, As it was said, COUNT is algebraic and should be fast, because it forces combiner. You should make sure that combiner is really used here. It can be disabled in some situations. I've encountered such situations many times when a job is tooo heavy in case no combiner is applied. Ruslan On Tue

Re: Best Practice: store depending on data content

2012-07-03 Thread Ruslan Al-Fakikh
Dmirtiy, In our organization we use file paths for this purpose like this: /incoming/datasetA /incoming/datasetB /reports/datasetC etc On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy wrote: > "It would give me the list of datasets in one place accessible from all > tools," > > And that's exactly