hi, all
I encountered an Exception like this:
ERROR org.apache.pig.tools.grunt.Grunt -
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during
parsing. null
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1641)
at org.apache.pig.PigServe
Imagine increasing the number of datasets by a couple orders of
magnitude. "ls" stops being a good browsing too pretty quickly.
Then, add the need to manage quotas and retention policies for
different data producers, to find resources across multiple teams, to
have a web ui for easy metadata searc
Thanks Alan,
I will try this.
-Sonia
On Tue, Jul 3, 2012 at 7:56 AM, Alan Gates wrote:
> You can set different parallel levels at different parts of your script by
> attaching parallel to the different operations. For example:
>
> Y = join W by a, X by b parallel 100;
> Z = order Y by a paral
I see, thanks Jonathan.
On Tue, Jul 3, 2012 at 10:01 AM, Jonathan Coveney wrote:
> UDF's are instantiated at job construction time a couple of times in order
> to inspect various properties about them. This is subideal, but alas. I
> generally lazily initialize in exec, as that is only called on
UDF's are instantiated at job construction time a couple of times in order
to inspect various properties about them. This is subideal, but alas. I
generally lazily initialize in exec, as that is only called on the
mapper/reducer. The lifecycle of UDF's can be a bit confusing in this way.
2012/7/3
There is not a way to do this in straight pig, but it is easy with a UDF
(ideally an accumulative UDF though if there are <100 records per key it
doesn't really matter). You'll do a nested sort in a foreach block then
pass the dates to the UDF. The docs should have an example of this.
2012/7/2 Bob
instead of doing "dump relation," do "explain relation" (then run
identically) and paste the output here. It will show whether the combiner
is being used,
2012/7/3 Ruslan Al-Fakikh
> Hi,
>
> As it was said, COUNT is algebraic and should be fast, because it
> forces combiner. You should make sure
Thanks.
Well, I did that and rebuild all pig jars with -Dhadoopversion=23 but am
still getting errors - this time around it's errors with opening an
iterator on alias b again, sigh...
I have singled out one of the similarly failing tests and pasted the output:
http://pastebin.com/reKjcX0p
The t
You can set different parallel levels at different parts of your script by
attaching parallel to the different operations. For example:
Y = join W by a, X by b parallel 100;
Z = order Y by a parallel 1;
store Z into 'onefile';
If your output is big I would suggest trying out ordering in paralle
Hi,
I need to determine the number of days between dates on a running list
of records. The records associated with each key will be small (less
than 100) I should be able to do it in one reducer. The data would
look something like this:
Say the headers are:
player_id, date, other_stuff
values
In our case we have
/result/CustomerId1
/result/CustomerId2
/result/CustomerId3
/result/CustomerId4
[...]
As we have a _lot_ of customers ;) we don't want to make an extra line
of code to each script.
I think the MultiStorage is perfect for our use case but we need to
extend it for avro usage.
B
Hi,
As it was said, COUNT is algebraic and should be fast, because it
forces combiner. You should make sure that combiner is really used
here. It can be disabled in some situations. I've encountered such
situations many times when a job is tooo heavy in case no combiner is
applied.
Ruslan
On Tue
Dmirtiy,
In our organization we use file paths for this purpose like this:
/incoming/datasetA
/incoming/datasetB
/reports/datasetC
etc
On Mon, Jul 2, 2012 at 9:37 PM, Dmitriy Ryaboy wrote:
> "It would give me the list of datasets in one place accessible from all
> tools,"
>
> And that's exactly
13 matches
Mail list logo