Pig is a higher level language ( more like Swazall for Google's
mapreduce ) on top of hadoop which makes hadoop easy to use.
It has SQL like syntaxes and can break the command into separate
mapreduce tasks and also chain them. From execution point of view
they
are as simple as running a shell script with very few
operators/commands.
Some of its commands are join, group, cogroup, load etc.
For example the following pig script takes a logfile in the
format :
<txid>,<txt>,<user> and outputs user-term-freq file in the foll
format : <txt>\t<user>\t<cnt>
raw = load 'tx_log.csv' using PigStorage(',') AS
(transactionid:chararray, txt:chararray, user:chararray);
tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
attribute;
user_term_freq = group tokenized by (user,attribute);
user_term_freq = foreach ratings generate
flatten(group),COUNT(tokenized);
store ratings into 'user_term_freq.txt';
During runtime pig takes the input and breaks it into several map
and
reduce tasks. It takes the hadoop-site.xml from its classpath.
-Prasen
On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[email protected]> wrote:
Needs to go somewhere like trunk/core/src/pig/main right, versus /
java/ ?
I also see no harm in adding it, other than that it would remain
pretty isolated right? isn't part of the build, can't be integrated
with the other code, etc.? Does it add value to package it with the
project then?
Perhaps I misunderstand what Pig can do or how it can relate to
Java?
On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[email protected]
>
wrote:
Hmm, hadn't really thought about it, but I see no reason why we
wouldn't
accept it and add it. I think our source tree can definitely
handle it.
I'd propose it go somewhere under:
trunk/core/src/main/pig/plsi
I'm not familiar with Pig, but I can learn, and I know others
are, is it
a
single file?
See http://cwiki.apache.org/MAHOUT/howtocontribute.html for
instructions
on
contributing. Basically, just attach the file(s) to a JIRA issue.
On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
Hi,
I have implemented hofmann's plsi/em algo in pig which I would
like
to contribute back to the community for further
scrutinization/improvement. Let me know if mahout is the
appropriate
forum or should it go to pig project.
Haven't seen any non-java contributions to Mahout yet, which
begs the
question is Mahout only java based ?
-Thanks,
Prasen
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/
Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search