Thanks, Raghu. Maybe another benefit of the UDF route is that it could
support the accumulator interface.
Since both approaches would use the HBase client API directly, there's no
Pig-specific benefit to using a loader, right?
Norbert
On Tue, May 29, 2012 at 8:37 PM, Raghu Angadi wrote:
> I w
I would still use a UDF, it is lot more flexible.
Passing large number of ids to the loader is part of the problem..
Your UDF would take a bag of ids and return bag{(session, events:bag{})}
You can pass the bag of ids in various ways :
- load ids as a relation, group all to put all of them in
There is a GSOC to move grunt into ANTLR, which may make it possible (if it
is desirable) to move more of these commands into macros.
2012/5/29 Alan Gates
> It's not an intended feature, but it is a side effect of the way macros
> are implemented. Pig actually has a couple of parser in it. One
Hi Nikhil,
Can you paste your script here or pastebin?
The warning message says you are trying to access a field that does not
exist. An easy way to debug would be to make sure you have records flowing
out of each Pig statement. You can use LIMIT operator to dump 10 records or
so and troubleshoot
Hello,
I am trying to run Pig in Hadoop mode with 2 clusters. I have installed
Hadoop 1.0.3 and Pig 0.10.
When I run Pig statements like "foreach" or if I use "MAX or AVG" i get the
following error:
WARN
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered
It's not an intended feature, but it is a side effect of the way macros are
implemented. Pig actually has a couple of parser in it. One parses Pig Latin,
the other is used by Grunt, the shell. Grunt does not know Pig Latin, but it
knows to pass it on to the Pig Latin parser. Pig Latin knows
If you do a grouping, the ordering changes. What you want to do is:
D = FOREACH C GENERATE COUNT($1) as countd;
D1 = GROUP D ALL;
D2 = FOREACH D1 {
ord = ORDER $1 BY $0 desc;
GENERATE MyCustomEvalFunc(ord);
}
Keep in mind that you'llbe ordering all of your data on one reducer, but
this isn't
Hi,
I've noticed that I seem to be losing the ordering of my relation after
passing the result of an ORDER BY to an EVAL function.
For example:
D = FOREACH C GENERATE COUNT($1) as countd;
E = ORDER D BY $0 DESC;
D1 = GROUP E ALL;
D2 = FOREACH D1 GENERATE MyCustomEvalFunc($1);
When inspecting th
Generally, sorting is the way to go. It's going to be difficult to get
around doing some sort of processing in order to make it easier to evaluate
equality.
If you want something generally O(n) instead of O(n log n), you could
calculate the hashCode for every tuple then SUM it (which is algebraic)
You should throw that on github, and then we could put it on
https://cwiki.apache.org/confluence/display/PIG/PigTools
2012/5/29 Johannes Schwenk
> For those who are writing pig scripts in kate, I have written a basic
> syntax highlighting file which can be found here:
>
> http://pastebin.com/dFR
We're analyzing session(s) using Pig and HBase, and this session data is
currently stored in a single HBase table, where rowkey is a
sessionid-eventid combo (tall table). I'm trying to optimize the
"extract-all-events-for-a-given-session" step of our workflow.
This could be a simple JOIN. But th
For those who are writing pig scripts in kate, I have written a basic
syntax highlighting file which can be found here:
http://pastebin.com/dFR71BVx
Installation:
# mkdir ~./kde/share/apps/katepart/syntax/
# cp pig.xml ~./kde/share/apps/katepart/syntax/
Have fun,
Johannes Schwenk
--
Softwaree
Hello all,
I'd like to verify output from a pig script that does not sort its
results prior to output. Thus the order of the tuples in the output is
non-deterministic. I would rather not add sorting to my script, because
I am potentially dealing with a lot of data here. As I have found
PigLatin do
13 matches
Mail list logo