Re: Pig mascot questions

2021-07-27 Thread Alan Gates
On Tue, Jul 27, 2021 at 6:30 PM Cat Lee Ball 
wrote:

> Hi everyone,
>
> I've been wondering and wanted to ask about the Apache Pig mascot:
>
>   -
> https://svn.apache.org/repos/asf/comdev/project-logos/originals/pig.svg
>
>
> In particular:
>
>   - Does anyone know if there's any history how this mascot came to be?
>   - What is the pig's name? Pronouns?
>   - Who drew the pig?
>   - Is the pig under any particular license?
>
Pig was originally developed at Yahoo, and then donated to Apache.  As far
as I recall the logo was drawn by someone in the Yahoo graphic design team
and donated as part of the original code grant.  I have no idea who the
original artist was.

I don't recall the pig in the logo ever having a name nor any particular
pronoun being specified.

I believe Apache's general approach on logos is that they are trademarked
along with the software project name, even if the trademark is not
registered.

>
> And more generally,
>
>   - How was it decided to call this software "Apache Pig"?
>
Quoting from O'Reilly's _Programming Pig_ "The story goes that the
researchers working on the project initially referred to it simply as 'the
language'.  Eventually they needed to call it something.  Off the top of
his head, one researcher suggested Pig, and the name stuck.  It is quirky
yet memorable and easy to spell.  While some have hinted that the name
sounds coy or silly, it has provided us with an entertaining nomenclature,
such as Pig Latin for the language, Grunt for the shell, and PiggyBank for
the CPAN-like shared repository."

>
>
> I recently added the pig to the Wikipedia list of computing mascots, and
> was
> curious to learn more about it.
>
>   - https://en.wikipedia.org/wiki/List_of_computing_mascots#P
>
>
> Thanks,
> Cat
>

Alan.


Re: Co Group vs Join in pig

2016-09-29 Thread Alan Gates
Filters can be pushed above co-group, the same as they can above join, if 
that’s what you’re asking.

The number of map reduce jobs depends.  A cogroup will always result in one 
job.  Some joins result in multiple jobs (skew joins), some in map only jobs 
(fragment-replicate).

Alan.

> On Sep 28, 2016, at 17:06, Kashif Hussain  wrote:
> 
> Will a co group with filter be equivalent to join ?
> I mean will pig optimize the former to achieve performance equivalent to
> latter ? I assume that single map reduce job will be spawned in both cases.
> 
> On Wed, Sep 28, 2016 at 11:14 PM, Alan Gates  wrote:
> 
>> Cogroup is only the first half of join.  It collects the records with the
>> matching key together.  It does not do the cross product of records with
>> matching keys.
>> 
>> If you are going to do a join (that is, you want to produce the matching
>> records) join is usually better as there are a number of join optimizations
>> available (skew join, fragment/replicate) which aren’t there for cogroup.
>> But if you don’t need to actually instantiate the records, cogroup can be
>> faster.  For example, say you just wanted to count the number of matching
>> records, then doing a cogroup and passing the resulting bags to COUNT would
>> give you your answer.
>> 
>> Alan.
>> 
>>> On Sep 28, 2016, at 07:15, Kashif Hussain  wrote:
>>> 
>>> Hi,
>>> 
>>> I want to know in which cases co group can perform better than join ?
>>> What is the advantage of co group ?
>>> 
>>> Regards,
>>> Kashif
>> 
>> 



Re: Co Group vs Join in pig

2016-09-28 Thread Alan Gates
Cogroup is only the first half of join.  It collects the records with the 
matching key together.  It does not do the cross product of records with 
matching keys.

If you are going to do a join (that is, you want to produce the matching 
records) join is usually better as there are a number of join optimizations 
available (skew join, fragment/replicate) which aren’t there for cogroup.  But 
if you don’t need to actually instantiate the records, cogroup can be faster.  
For example, say you just wanted to count the number of matching records, then 
doing a cogroup and passing the resulting bags to COUNT would give you your 
answer.

Alan.

> On Sep 28, 2016, at 07:15, Kashif Hussain  wrote:
> 
> Hi,
> 
> I want to know in which cases co group can perform better than join ?
> What is the advantage of co group ?
> 
> Regards,
> Kashif



Re: How Tez work in Hive and Pig

2016-08-18 Thread Alan Gates

> On Aug 12, 2016, at 01:38, darion.yaphet  wrote:
> 
> Hi team :
> 
> 
> We using Tez as our execute engine on hive and pig . I'm very curious about 
> how to Hive and pig use it to execute plan . 
> 
> 
> Is there some design  document or implement detail about it ?  thanks :)

https://cwiki.apache.org/confluence/display/PIG/Pig+on+Tez
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez

Alan.

Re: Optimally assigning reducers

2016-07-06 Thread Alan Gates
My first guess is that your join has significant skew in the keys, so many are 
getting assigned to a single reducer.  Have you tried the skew join 
algorithm[1]?

Alan.

1. https://pig.apache.org/docs/r0.16.0/perf.html#skewed-joins
> On Jul 6, 2016, at 08:55, Nigam, Vibhor  wrote:
> 
> Hi
> 
> I am facing a problem in my pig script. It has a simple inner join and a 
> grouping. However after around 70% of the script gets processed all the 
> reduction process gets assigned to one reducer, which in turn increases the 
> complete time of the script heavily.
> 
> I need to use this script for automating the process which under given 
> circumstances seems problematic. Kindly, let me know how can I overcome this 
> and assign reducers optimally
> 
> Best Regards
> Vibhor Nigam
> Product Engineer III
> TnP, Comcast
> 1717 Arch Street, Philadelphia
> 
> 



Re: How does Pig Pass Data from First Job and its next Job

2015-09-15 Thread Alan Gates
Pig writes the data to disk in it's own format.  Given that in the 
cluster you don't know which machines tasks will run on storing it in 
memory directly is not feasible.  You can use something like HDFS' in 
memory files (which Pig doesn't do yet) or Spark's RDD's for this.


Alan.


Argho Chatterjee 
September 15, 2015 at 6:12
Hi All,

As we all know, Apache Pig is a data flow language. If I write a Pig 
Script

and the Pig decides to split and run two or more jobs to execute the task
in hand, so How does Pig Store the data which it passes from job1 to 
job 2.

???!!

*I read the Pig documentation which says* :-

"*Pig allocates a fix amount of memory to store bags and spills to disk as
soon as the memory limit is reached. This is very similar to how Hadoop
decides when to spill data accumulated by the combiner.*"

*(url : http://pig.apache.org/docs/r0.9.1/perf.html#memory-management
)*

So Does Pig has a writer which stores the output of an intermediate job in
Memory / RAM for better performance (spill to disk if required) and 
then if

PIG has implemented a Reader which reads the data directly from memory to
pass that data to the next Job for Processing???

In Mapreduce, we write the entire data to disk and then read it again for
the next job to start.

Does Pig has a upper hand here, by implementing readers and writers which
writes in RAM/memory (spill if required) and reads from RAM (and disk if
required) for better Performance.

Kindly share your expertise/ views on the highlighted comment from the PIG
documentation as to what does it actually mean or is stating otherwise.

Thanks in Advance,

Cheers :))



Re: Query | Join Internals

2015-07-30 Thread Alan Gates
Here's the original design doc: 
https://wiki.apache.org/pig/PigSkewedJoinSpec


Alan.


Gagan Juneja 
July 29, 2015 at 21:30
Any help?

Regards,
Gagan


Gagan Juneja 
July 14, 2015 at 4:56
Hi Team,

We are using Pig intensively in our various projects. We are doing
optimizations for that we wanted to know how join works. Though we have
moved to Skewed joins for some of our use cases.

At many places in the documentation this is mentioned that in join data is
streamed for second table. But I was identify how this can fit in map
reduce paradigm.

1. Can anyone please clarify how join happens in pig.
2. What is the meaning of Streaming here? Are we loading the files 
directly

in the reducres?


Regards,
Gagan



Re: PigMix extension

2015-07-15 Thread Alan Gates
The initial goal of PigMix was definitely to give the project a way to 
measure itself against MapReduce and between different versions of 
releases.  So that falls into your synthetic category.


That said, if adding a field enables extending the bench mark into new 
territory and makes it more useful then that seems like a clear win.


Alan.


Keren Ouaknine 
July 14, 2015 at 12:44
Hi,

I am working on expanding the PigMix benchmark.
I am interested to add queries matching more realistic use cases, such as
finding what are the highest revenue of a page or what is the burst of
activity for a specific page. Additionally, I would like to add OLTP-like
queries such as finding other users from the same neighborhood looking 
at a

specific page.

The current PigMix table does not have an id for a page access (see 
details
on page_views here 
).

Therefore I cannot run the above queries.

I am wondering why was this field omitted from the schema of page_views?
It seems a fundamental field for all aggregation queries on page_views.

I see two options: either there is another use case that this schema
targets (what is it?) or the benchmark's goal is not to target real use
cases and is merely oriented towards a synthetic performance and
measurement goal.

Any ideas?

Thank you,
Keren

​PS: I sent this email to both the devs and users' mailing list, not to
spam us :) but because these queries are both a users and a development
concern. ​




Re: pig 0.11.x release download

2015-04-15 Thread Alan Gates

https://archive.apache.org/dist/pig/

Alan.


Alex Nastetsky 
April 15, 2015 at 3:45
Does anyone know where I can get a 0.11.x release of Pig?

This site has 2 links -- one to releases 0.8 and later, and to 0.7 and
earlier:
https://pig.apache.org/releases.html#Download

But the first link only has releases starting with 0.12. I don't see
anything between 0.7 and 0.12.

Thanks.



Re: REGISTER ... with or without quotes

2015-04-09 Thread Alan Gates
Though I'm tempted to say the O'Reilly book is always right, the 
official stance on this is the one in the pig documentation on the website.


Alan.


Michael Howard 
April 9, 2015 at 9:43
Q: When using the REGISTER statement to register .jar files containing
UDFs, should the jar file name be in quotes or not?

The O'Reilly book Programming Pig consistently uses single quotes around
.jar path names with the REGISTER command.

The documentation for 0.14.0 (and 0.7.0) says:
Do not place the name in quotes.

In my testing, it works either with or without the .jar path in quotes.

Q: Is there an "official" stance on this?


Michael



Re: Help with HCatLoader against remote Hive2

2015-03-23 Thread Alan Gates
HiveMetaStore is definitely meant to be hit remotely.  Your URI should 
be thrift://your.host.com:9083.


Alan.


Adam Silberstein 
March 22, 2015 at 17:40
Hi,
Having some trouble getting hcatloader to work.

My script is this:
A = LOAD 'testTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;

I got it working using PigServer on the node where Hive is running. That
hive-site.xml contains this property:

hive.metastore.uris
thrift://hadoop:9083


HiveMetaStore is running at 9083 and I seem to be able to hit it. Things
work.

When I run from a remote location using the same exact code, 
classpath, and

config, when I try to run against 9083 I get connection refused. I think
that's expected since I don't think HiveMetaStore is meant to be hit
remotely.

I tried changing 9083 to 1, where HiveServer2 is running. In this case
my request hangs and I see a thrif error in the hive-server2.log (see
below).

I'm guessing this is not the proper way to get pig to make calls against
HiveServer2. My question: is there a way to use Pig remotely in this way?
Or do I need to have Hive running on the edge node where I run my Pig 
jobs?


Thanks,
Adam


java.lang.RuntimeException:
org.apache.thrift.transport.TTransportException: Invalid status -128
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:227)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status
-128
at
org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:230)
at
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
at
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:262)
at
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more



Re: Pig Meetup at LinkedIn 3/14

2014-01-16 Thread Alan Gates
Mark from LinkedIn is working on getting something set up.  He’ll post details 
on the meetup page once he has them.

Alan.

On Jan 14, 2014, at 8:44 PM, Joao Salcedo  wrote:

> HI Alan,
> 
> Is it possible to record the meeting or do some  WebEx for people living in
> other cities or countries like me.
> 
> cheers,
> 
> Joao
> 
> 
> On Wed, Jan 15, 2014 at 3:39 PM, Alan Gates  wrote:
> 
>> A Pig Meetup is scheduled for March 14th.  Planned talks include Pig on
>> Tez, Pig on Storm, Intel Graph Builder, PigPen (MR for Clojure) and
>> Accumulo Storage.  See below for details.  You can sign up meetup.com,
>> under Pig User Group.
>> 
>> Alan.
>> 
>> 
>> Begin forwarded message:
>> 
>> *From: *Pig user group 
>> *Subject: **Invitation: Pig User Meetup*
>> *Date: *January 14, 2014 at 4:28:30 PM PST
>> *To: *ga...@apache.org
>> 
>> 
>> [image: Meetup] <http://www.meetup.com/t/ea1_1/>
>> 
>> NEW MEETUP
>> Pig User 
>> Meetup<http://www.meetup.com/__ms10754163/PigUser/events/160604192/t/ea1_grp/?rv=ea1&_af_eid=160604192&_af=event&expires=1389918509991&sig=a4f970e709f15870e926394cf470a4becf9ecc8c>
>> Pig user group
>> Added by Alan Gates
>> Friday, March 14, 2014
>> 2:00 PM
>> LinkedIn
>> 2025 Stierlin Ct
>> Unite Room (2nd Floor)
>> Mountain View, CA 94043
>> I'm going
>> Change your 
>> RSVP<http://www.meetup.com/PigUser/events/160604192/t/ea1_upd/?rv=ea1>
>> Tentative lineup for this meetup:
>> Pig on Tez
>> Pig on Storm
>> Intel Graph Builder
>> Pig Pen (MR for Clojure)
>> Accumulo Storage
>> LEARN MORE<http://www.meetup.com/PigUser/events/160604192/t/ea1_grp/?rv=ea1>
>> 
>> Unsubscribe<http://www.meetup.com/__ms10754163/PigUser/optout/?submit=true&email=nuMuMail&expires=1389918509991&sig=f89a07626c885164b76519185b77dd31624e5164>
>> from similar emails from this Meetup Group
>> Add *i...@meetup.com * to your address book to receive
>> all Meetup emails
>> Meetup, POB 4668 #37895 NY NY USA 10163
>> *Meetup HQ in NYC is hiring!* meetup.com/jobs<http://www.meetup.com/jobs/>
>> 
>> 
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Pig Meetup at LinkedIn 3/14

2014-01-14 Thread Alan Gates
A Pig Meetup is scheduled for March 14th.  Planned talks include Pig on Tez, 
Pig on Storm, Intel Graph Builder, PigPen (MR for Clojure) and Accumulo 
Storage.  See below for details.  You can sign up meetup.com, under Pig User 
Group.

Alan.


Begin forwarded message:

> From: Pig user group 
> Subject: Invitation: Pig User Meetup
> Date: January 14, 2014 at 4:28:30 PM PST
> To: ga...@apache.org
> 
>  
>   
>  
> NEW MEETUP
> Pig User Meetup
> Pig user group
> Added by Alan Gates
> Friday, March 14, 2014
> 2:00 PM
> LinkedIn
> 2025 Stierlin Ct
> Unite Room (2nd Floor)
> Mountain View, CA 94043
> I'm going
> Change your RSVP
> Tentative lineup for this meetup:
> Pig on Tez
> Pig on Storm
> Intel Graph Builder
> Pig Pen (MR for Clojure)
> Accumulo Storage
> LEARN MORE
> 
> Unsubscribe from similar emails from this Meetup Group
> Add i...@meetup.com to your address book to receive all Meetup emails
> Meetup, POB 4668 #37895 NY NY USA 10163
> Meetup HQ in NYC is hiring! meetup.com/jobs


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Meetup//Meetup Events v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:Meetup Events
X-MS-OLK-FORCEINSPECTOROPEN:TRUE
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
X-LIC-LOCATION:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:19700308T02
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:19701101T02
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20140115T002239Z
DTSTART;TZID=America/Los_Angeles:20140314T14
DTEND;TZID=America/Los_Angeles:20140314T17
STATUS:CONFIRMED
SUMMARY:Pig User Meetup
DESCRIPTION:Pig user group\nFriday\, March 14 at 2:00 PM\n\nTentative lin
 eup for this meetup:\n\nPig on Tez\n\nPig on Storm\n\nIntel Graph Builde
 r\n\nPig Pen (MR for Clojure)\n\nAccumulo Storage\n\nDetails: http://www
 .meetup.com/PigUser/events/160604192/
ORGANIZER;CN=Meetup Reminder:MAILTO:i...@meetup.com
CLASS:PUBLIC
CREATED:20140115T002227Z
GEO:37.40;-122.01
LOCATION:LinkedIn (2025 Stierlin Ct\, Unite Room (2nd Floor)\, Mountain V
 iew\, CA 94043)
URL:http://www.meetup.com/PigUser/events/160604192/
LAST-MODIFIED:20140115T002227Z
UID:event_160604...@meetup.com
END:VEVENT
END:VCALENDAR


Re: Does Pig support HCatalogStorer table with buckets

2013-12-09 Thread Alan Gates
No.  HCat explicitly checks if a table is bucketed, and if so disable storing 
to it to avoid writing to the table in a destructive way.

Alan.

On Dec 6, 2013, at 3:45 PM, Araceli Henley wrote:

> Hi
> 
> 
> :
> 
> QUESTION:
> 
> :
> 
> Can anyone confirm if HCatalogStore works with a hive table that was
> declared with buckets?
> 
> 
> :
> 
> DETAILS:
> 
> :
> 
> 
> I have a table in hive that was created with buckets. But when I tried to
> load the data with HCatalogStorer it fails with the following error.
> 
> 
> Store into a partition with bucket definition from Pig/Mapreduce is not
> supported.
> 
> 
> I have a table declaration in hive:
> 
> 
> ..
> 
>   PARTITIONED BY(dtStr STRING)
> 
>   CLUSTERED BY(sessionid) SORTED BY(timestr) INTO 32 BUCKETS
> 
>   ROW FORMAT DELIMITED
> 
>   FIELDS TERMINATED BY '1'
> 
>   COLLECTION ITEMS TERMINATED BY '2'
> 
>   MAP KEYS TERMINATED BY '3'
> 
>   STORED AS ORC;
> 
> 
> From pig, I load the data with HCatStorer:
> 
> 
> STORE sessnz_all INTO '$DB.allPocData' USING
> org.apache.hcatalog.pig.HCatStorer();
> 
> 
> 
> Details at logfile:
> /home/araceli/src/bigdata/projects/cisco_webanalytics_poc/src/server/pig/scripts/pig_1386373152479.log
> 
> [araceli@greenhost03 scripts]$ pig -version
> 
> Apache Pig version 0.11.2-mapr (rexported)
> 
> compiled Aug 27 2013, 13:50:32
> 
> [araceli@greenhost03 scripts]$ hive -version
> 
> 
> Logging initialized using configuration in
> jar:file:/opt/mapr/hive/hive-0.11/lib/hive-common-0.11-mapr.jar!/hive-log4j.properties
> 
> Hive history
> 
> I have a table declaration in hive:
> 
> 
> ..
> 
>   PARTITIONED BY(dtStr STRING)
> 
>   CLUSTERED BY(sessionid) SORTED BY(timestr) INTO 32 BUCKETS
> 
>   ROW FORMAT DELIMITED
> 
>   FIELDS TERMINATED BY '1'
> 
>   COLLECTION ITEMS TERMINATED BY '2'
> 
>   MAP KEYS TERMINATED BY '3'
> 
>   STORED AS ORC;
> 
> 
> From pig, I load the data with HCatStorer:
> 
> 
> STORE sessnz_all INTO '$DB.allPocData' USING
> org.apache.hcatalog.pig.HCatStorer();
> 
> 
> 
> Details at logfile:
> /home/araceli/src/bigdata/projects/cisco_webanalytics_poc/src/server/pig/scripts/pig_1386373152479.log
> 
> [araceli@greenhost03 scripts]$ pig -version
> 
> Apache Pig version 0.11.2-mapr (rexported)
> 
> compiled Aug 27 2013, 13:50:32
> 
> [araceli@greenhost03 scripts]$ hive -version
> 
> 
> Logging initialized using configuration in
> jar:file:/opt/mapr/hive/hive-0.11/lib/hive-common-0.11-mapr.jar!/hive-log4j.properties
> 
> Hive history


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Bag of tuples

2013-11-06 Thread Alan Gates
Do you mean you want to find the top 5 per input record?  Also, what is your 
ordering criteria?  Just sort by id?  Something like this should order all 
tuples in each bag by id and then produce the top 5.  My syntax may be a little 
off as I'm working offline and don't have the manual in front of me, but this 
should be the general idea.

A = load 'yourinput' as (b:bag);
B = foreach A {
B1 = order A by $0; -- order on the id
B2 = limit B1 5;
generate flatten(B2);
}

Alan.

On Nov 5, 2013, at 9:52 AM, Sameer Tilak wrote:

> Hi Pig experts,
> Sorry to post so many questions, I have one more question on doing some 
> analytics on bag of tuples.
> 
> My input has the following format:
> 
> {(id1,x,y,z), (id2, a, b, c), (id3,x,a)}  /* User 1 info */
> {(id10,x,y,z), (id9, a, b, c), (id1,x,a)} /* User 2 info */
> {(id8,x,y,z), (id4, a, b, c), (id2,x,a)} /* User 3 info */
> {(id6,x,y,z), (id6, a, b, c), (id9,x,a)} /* User 4 info */
> 
> I can change my UDF to give more simple output. However, I want to find out 
> if something like this can be done easily:
> I would like to find out top 5 ids (field 1 in a tuple) among all the users. 
> Note that each user has a bag and the first field of each tuple in that bag 
> is id. 
> 
> How difficult will it be to filter based on fields of tuples and do analytics 
> across the entire user base.
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: support for distributed cache archives

2013-11-04 Thread Alan Gates
I don't see why we couldn't.  Step one would be to file a JIRA.  After that, if 
you have the time and inclination feel free to provide a patch for it.

Alan.

On Nov 1, 2013, at 10:31 PM, Jim Donofrio wrote:

> Any thoughts on this?
> 
> On 10/22/2013 10:36 AM, Jim Donofrio wrote:
>> JobControlCompiler.setupDistributedCache only calls
>> DistributedCache.addCacheFile. Can you add support for adding archives
>> in the distributed cache by calling DistributedCache.addCacheArchive
>> based on a set of typical file extensions or by adding a
>> getCacheArchives() method to EvalFunc?
>> 
>> Thanks


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: convert rows to columns in Pig

2013-10-21 Thread Alan Gates
I think the following will do what you want:

A = load 'input';
B = group A all;
C = foreach B generate flatten(BagToTuple(A));

Note that this will collect all records into one bag and produce one output 
record.  That won't scale well, and may not be what you want.

Alan.

On Oct 18, 2013, at 8:37 PM, soniya B wrote:

> any one can guide me on this problem?
> 
> 
> On Fri, Oct 18, 2013 at 1:15 PM, soniya B  wrote:
> 
>> I think it will work for to change columns to rows. I am looking to change
>> rows to columns.
>> 
>> 
>> On Fri, Oct 18, 2013 at 12:14 PM, ajay kumar wrote:
>> 
>>> try this,
>>> 
>>> A = load 'input';
>>> B = foreach A generate FLATTEN(TOBAG(*));
>>> 
>>> 
>>> On Fri, Oct 18, 2013 at 12:04 PM, soniya B 
>>> wrote:
>>> 
 Hi,
 
 How to convert rows to columns in pig latin?
 
 example:
 
 input file
 
 A   100 3
 B   200 4
 C   400 6
 
 required output
 
 ABC
 
 100 200400
 
>>> 
>>> 
>>> 
>>> --
>>> *Thanks & Regards,*
>>> *S. Ajay Kumar
>>> +91-9966159106*
>>> 
>> 
>> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: number of M/R jobs for a Pig Script

2013-10-15 Thread Alan Gates
Pig handles doing multiple group bys on the same input, often in a single MR 
job.  So:

A = load 'file';
B = group A by $0;
C = foreach B generate group, COUNT(A);
store C into 'output1';
D = group A by $1;
E = foreach D generate group, COUNT(A);
store D into 'output2';

This can be done in a single MR job.  Is that what you're looking for?

Alan.

On Oct 15, 2013, at 12:12 PM, ey-chih chow wrote:

> What I really want to know is,in Pig, how can I read an input data set only
> once and generate multiple instances with distinct keys for each data point
> and do a group-by?
> 
> Best regards,
> 
> Ey-Chih Chow
> 
> 
> On Tue, Oct 15, 2013 at 10:16 AM, Pradeep Gollakota 
> wrote:
> 
>> I'm not aware of anyway to do that. I think you're also missing the spirit
>> of Pig. Pig is meant to be a data workflow language. Describe a workflow
>> for your data using PigLatin and Pig will then compile your script to
>> MapReduce jobs. The number of MapReduce jobs that it generates is the
>> smallest number of jobs (based on the optimizers) that Pig thinks it needs
>> to complete the workflow.
>> 
>> Why do you want to control the number of MR jobs?
>> 
>> 
>> On Tue, Oct 15, 2013 at 10:07 AM, ey-chih chow  wrote:
>> 
>>> Thanks everybody.  Is there anyway we can programmatically control the
>>> number of M-R jobs that a Pig script will generate, similar to write M-R
>>> jobs in Java?
>>> 
>>> Best regards,
>>> 
>>> Ey-Chih Chow
>>> 
>>> 
>>> On Tue, Oct 15, 2013 at 6:14 AM, Shahab Yunus >>> wrote:
>>> 
 And Geert's comment about using external-to-Pig approach reminds me
>> that,
 then you have Netflix's PigLipstick too. Nice visual tool for actual
 execution and stores job history as well.
 
 Regards,
 Shahab
 
 
 On Tue, Oct 15, 2013 at 8:51 AM, Geert Van Landeghem <
>> g...@foundation.be
> wrote:
 
> You can also use ambrose to monitor execution of your pig script at
> runtime. Remark: from pig-0.11 on.
> 
> It show you the DAG of MR jobs and which are currently being
>> executed.
>>> As
> long as pig-ambrose is connected to the execution of your script
 (workflow)
> you can replay the workflow.
> 
> --
> kind regards,
> Geert
> 
> 
> 
> 
> On 15-okt.-2013, at 14:43, Shahab Yunus 
>>> wrote:
> 
>> Have you tried using ILLUSTRATE and EXPLAIN command? As far as I
>>> know,
 I
>> don't think they give you the exact number as it depends on the
>>> actual
> data
>> but I believe you can interpret it/extrapolate it from the
>>> information
>> provided by these commands.
>> 
>> Regards,
>> Shahab
>> 
>> 
>> On Tue, Oct 15, 2013 at 3:57 AM, ey-chih chow 
 wrote:
>> 
>>> Hi,
>>> 
>>> I have a Pig script that has two group-by statements on the the
>>> input
> data
>>> set.  Is there anybody knows how many M-R jobs the script will
 generate?
>>> Thanks.
>>> 
>>> Best regards,
>>> 
>>> Ey-Chih Chow
>>> 
> 
> 
 
>>> 
>> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Accessig paritcular folder

2013-10-04 Thread Alan Gates
For any Pig loader that reads files from HDFS, filenames are passed directly to 
HDFS.  This means HDFS style globs are supported, which means the answer to 
your question depends on the version of HDFS you have.  For your version of 
Hadoop, take a look at the documentation for FileSystem.globStatus.  It will 
explain the supported globs.  Based on the documentation from Hadoop 1.2, I 
believe the following pattern would work:

A = load 'data/shard?/d1_1/[^_d2*]'

Alan.

On Oct 2, 2013, at 12:59 PM, jamal sasha wrote:

> Hi,
>I have data in this one folder like following:
> 
> data---shard1---d1_1
>|  |_d2_1
>Lshard2---d1_1
>|  |_d2_2
>Lshard3---d1_1
>|  |_d2_3
>Lshard4---d1_1
>   |_d2_4
> 
> 
> Now, I want to search something in d1 (and excluding all the d2's) in it.
> How do i do this in pig?
> Thanks


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: piglipstick

2013-09-06 Thread Alan Gates
On the Hive side, the Netflix team recently told me they are working on 
"honey", an equivalent thing for Hive.  I believe a prototype is in their 
github.

Alan.

On Sep 5, 2013, at 11:40 PM, ajay kumar wrote:

> Hi all,
> any one worked on piglipstick?
> 
> Please share some info about piglipstick like how to install, how it works
> etc.
> 
> can i apply lipstick to hive also?
> 
> 
> 
> 
> 
> -- 
> *Thanks & Regards,*
> *S. Ajay Kumar
> +91-9966159106*


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Grunt Shell hangs on Cygwin.

2013-08-08 Thread Alan Gates
Yes, no cygwin tools are required.  You will need the hadoop version from 
branch-1-win as well as pig trunk to make this work.

Alan.

On Aug 5, 2013, at 10:03 PM, Darpan R wrote:

> Thanks Alan I am not sure if I quite understand this.
> Do you mean directly from Windows command prompt?
> 
> Regards,
> Darpan
> 
> On 6 August 2013 02:34, Alan Gates  wrote:
> 
>> You might try running Pig trunk without cygwin.  Much work has been done
>> lately to make Pig work directly on windows.
>> 
>> Alan.
>> 
>> On Aug 4, 2013, at 9:49 PM, Darpan R wrote:
>> 
>>> Thanks Sudhir,
>>> I tried running scripts , it takes a long time to start pig and stop (
>>> setup/cleanup) .
>>> Please keep us updated if any one is able to find the work around for the
>>> same.
>>> 
>>> Thanks & Regards,
>>> Darpan
>>> 
>>> On 3 August 2013 10:17, Sudhir N  wrote:
>>> 
>>>> I have same problem, I could not find a solution, seems grunt doesn't
>> work
>>>> on Cygwin, I stopped trying.. I run scripts
>>>> 
>>>> 
>>>> Sudhir N
>>>> 
>>>> -Original Message-
>>>> From: Darpan R [mailto:darpa...@gmail.com]
>>>> Sent: Friday, August 02, 2013 6:59 PM
>>>> To: user@pig.apache.org
>>>> Subject: Grunt Shell hangs on Cygwin.
>>>> 
>>>> Hi Guys,
>>>> 
>>>> I am running Hadoop on local mode on my windows 7 machine (32 Bit).
>>>> I've installed HIVE/PIG/Hadoop/Java6 all on the C: drive.
>>>> I am using Cygwin version : 2.819.
>>>> PIG Version I tried with 0.11 and 0.10 (both I am facing issue) Hadoop
>>>> Version : 1.1.2 Hive Version : 0.10 Java version : 1.6 minor version 34
>>>> 
>>>> I've mounted c: on the cygwin. I am able to run hadoop commands from the
>>>> cygwin terminal for example : fs -ls etc. I am also able to start grunt
>> and
>>>> hive shells.
>>>> But the real problem is :
>>>> Any command I enter on grunt shell ( example : fs -ls or records =
>>>> LOAD.
>>>> ) I do not see any output, it kind of hangs.
>>>> Similarly with the hive prompt if I give the command as show tables ; I
>> do
>>>> not see any output just cursor keeps on blinking! Any keyboard inputs
>> and
>>>> gives NOTHING.
>>>> System appears to be doing NOTHING.
>>>> To me everything looks fine but definitely something is going wrong :-)
>>>> 
>>>> 
>>>> Here are my classpath and environment variables from .bashrc file:
>>>> 
>>>> export JAVA_HOME=/c/Java/jdk1.6.0_34
>>>> export HADOOP_HOME=/c/Hadoop
>>>> export PIG_HOME=/c/PIG
>>>> export HIVE_HOME=/c/Hive
>>>> export HADOOP_BIN=$HADOOP_HOME/bin/hadoop
>>>> export PATH=$PATH:/c/Java/jdk1.6.0_34/bin
>>>> export PATH=$PATH:$HADOOP_HOME/bin
>>>> export PATH=$PATH:$HIVE_HOME/bin
>>>> export PATH=$PATH:$PIG_HOME/bin
>>>> 
>>>> I am not sure if I am missing something. Any help will be highly
>>>> appreciated.
>>>> 
>>>> Thanks & Regards,
>>>> -Darpan
>>>> 
>>>> 
>> 
>> 



Re: Grunt Shell hangs on Cygwin.

2013-08-05 Thread Alan Gates
You might try running Pig trunk without cygwin.  Much work has been done lately 
to make Pig work directly on windows.

Alan.

On Aug 4, 2013, at 9:49 PM, Darpan R wrote:

> Thanks Sudhir,
> I tried running scripts , it takes a long time to start pig and stop (
> setup/cleanup) .
> Please keep us updated if any one is able to find the work around for the
> same.
> 
> Thanks & Regards,
> Darpan
> 
> On 3 August 2013 10:17, Sudhir N  wrote:
> 
>> I have same problem, I could not find a solution, seems grunt doesn't work
>> on Cygwin, I stopped trying.. I run scripts
>> 
>> 
>> Sudhir N
>> 
>> -Original Message-
>> From: Darpan R [mailto:darpa...@gmail.com]
>> Sent: Friday, August 02, 2013 6:59 PM
>> To: user@pig.apache.org
>> Subject: Grunt Shell hangs on Cygwin.
>> 
>> Hi Guys,
>> 
>> I am running Hadoop on local mode on my windows 7 machine (32 Bit).
>> I've installed HIVE/PIG/Hadoop/Java6 all on the C: drive.
>> I am using Cygwin version : 2.819.
>> PIG Version I tried with 0.11 and 0.10 (both I am facing issue) Hadoop
>> Version : 1.1.2 Hive Version : 0.10 Java version : 1.6 minor version 34
>> 
>> I've mounted c: on the cygwin. I am able to run hadoop commands from the
>> cygwin terminal for example : fs -ls etc. I am also able to start grunt and
>> hive shells.
>> But the real problem is :
>> Any command I enter on grunt shell ( example : fs -ls or records =
>> LOAD.
>> ) I do not see any output, it kind of hangs.
>> Similarly with the hive prompt if I give the command as show tables ; I do
>> not see any output just cursor keeps on blinking! Any keyboard inputs and
>> gives NOTHING.
>> System appears to be doing NOTHING.
>> To me everything looks fine but definitely something is going wrong :-)
>> 
>> 
>> Here are my classpath and environment variables from .bashrc file:
>> 
>> export JAVA_HOME=/c/Java/jdk1.6.0_34
>> export HADOOP_HOME=/c/Hadoop
>> export PIG_HOME=/c/PIG
>> export HIVE_HOME=/c/Hive
>> export HADOOP_BIN=$HADOOP_HOME/bin/hadoop
>> export PATH=$PATH:/c/Java/jdk1.6.0_34/bin
>> export PATH=$PATH:$HADOOP_HOME/bin
>> export PATH=$PATH:$HIVE_HOME/bin
>> export PATH=$PATH:$PIG_HOME/bin
>> 
>> I am not sure if I am missing something. Any help will be highly
>> appreciated.
>> 
>> Thanks & Regards,
>> -Darpan
>> 
>> 



Re: Execute multiple PIG scripts parallely

2013-07-22 Thread Alan Gates
If you write your scripts as one large Pig script Pig will execute them in 
parallel.  You can keep from confusing your individual scripts by writing one 
master script that has imports (see 
http://pig.apache.org/docs/r0.11.1/cont.html#import-macros ).  You just need to 
make sure your various scripts don't share variable names, as imports don't 
maintain a namespace.

Alan.

On Jul 22, 2013, at 3:34 AM, Bhavesh Shah wrote:

> Hello All,
> 
> 
> 
> I have multiple PIG Script with and currently I am executing it in sequential 
> manner using command 
> 
> pig -x mapreduce /path/to/Script/Script1.pig && /path/to/Script/Script2.pig 
> && /path/to/Script/Script3.pig
> 
> 
> 
> But now I am looking for executing those scripts in parallel as all are 
> independent of each other. I searched for it but not getting exactly.
> 
> 
> 
> So is there any way through which I can execute my all scripts parallely?
> 
> 
> 
> 
> 
> Thanks,
> 
> Bhavesh Shah
> 



Re: something about builtin.TOP

2013-07-22 Thread Alan Gates
Agreed.  Please file a JIRA on this.

Alan.

On Jul 22, 2013, at 1:57 AM, Qian, Chen(AWF) wrote:

> Hi all,
> 
> builtin.TOP() function can't ignore NULL value, it'll lead to NULL Pointer 
> error.
> 
> That doesn't make sense
> 
> Best,
> Ned
> 



Re: DISTINCT and paritioner

2013-07-18 Thread Alan Gates
You're correct.  It looks like an optimization was put in to make distinct use 
a special partitioner which prevents the user from setting the partitioner.  
Could you file a JIRA against the docs so we can get that fixed?

Alan.

On Jul 17, 2013, at 11:27 AM, William Oberman wrote:

> The docs say DISTINCT can take a custom partitioner.  How does that work?
> What is "K" and "V"?
> I'm having some doubts the docs are correct.  I wrote a test partitioner
> that does a System.out of K and V.  I then wrote simple scripts to do JOIN,
> GROUP and DISTINCT.  For JOIN and GROUP I see my system.outs(*).  For
> DISTINCT, I see nothing
> 
> Using 0.11.1.
> 
> will



Re: Which Pig Version with Hadoop 0.22

2013-07-17 Thread Alan Gates
We have never produced a release that works with Hadoop 0.22.  There were some 
patches for it, see https://issues.apache.org/jira/browse/PIG-2277  You might 
be able to build your own version.

Alan.
On Jul 17, 2013, at 10:41 AM, vivek thakre wrote:

> Hello All,
> 
> Which Apache Pig Release would work wiht Hadoop 0.22 release?
> 
> Thank you



Re: question about syntax for nested evaluations using bincond

2013-07-15 Thread Alan Gates
No, both are equally correct.  == has higher precedence than ?:

Alan.

On Jul 5, 2013, at 1:39 PM, mark meyer wrote:

> hello,
> 
> i am new to pig and have a question regarding the syntax arrangement for 
> nested evaluations using bincond.
> 
> both of these seem to work and produce identical results.
> 
> is one syntax "more" correct?
> 
> C = foreach B generate 
>id,
>name,
>role,
>((location == 'P02' ? 'IL': (
>(location == 'P06' ? 'FL': (
>(location == 'P09' ? 'WI': (
>(location == 'P11' ? 'CA': id as location;
> 
> 
> C = foreach B generate 
>id,
>name,
>role,
>((location == 'P02') ? 'IL': (
>(location == 'P06') ? 'FL': (
>(location == 'P09') ? 'WI': (
>(location == 'P11') ? 'CA': id as location;
> 
> 
> thx
> mark
> 
> 



Re: join with 2 skewed tables - a suggestion

2013-06-19 Thread Alan Gates

On Jun 17, 2013, at 7:24 AM, Ido Hadanny wrote:

> Hey,
> 
> We noticed that the current skewed join supports only 1 skewed table, and
> assumes that the second table isn't skewed.
> Please review this suggestion for a 2 skewed tables design:
> 
>   - Sample both tables
>   - for each skewed key (with many records in at least one table), build a
>   surrogate key in a GFCross style - e.g. if for this key there are 3M keys
>   from the left table and 7M from the right table, and there are 100 reducers
>   available, build GFCross with dimensions of sqrt(100*3/7) and sqrt(100*7/3)
> 
> What do you say? Is this a necessary enhancement request? Or is it safe to
> assume that only one table will be skewed in each join?

When we built the original skewed join we chose to worry about it only in the 
case of 1 table being skewed for two reasons:

1) It made joins of skewed tables (even two skewed tables) possible.  Before it 
was possible to have a join where neither table could fit all instances of a 
given key in memory (as the default hash join implementation requires) and thus 
the join could not be done.  With this implementation you are guaranteed that 
you can split key instances for one of the inputs and thus complete the join.  
If it is skewed on both sides the join will still be slow, as you point out.
2) It addressed most of our use cases.

Obviously being able to handle cases where both sides are skewed more 
efficiently will be very valuable.  If you're thinking of contributing in this 
area I encourage you to file a JIRA with your proposal.

Alan.
> 
> Thanks, Dudu and Ido
> 
> -- 
> Sent from my androido



Fwd: DesignLounge @ HadoopSummit

2013-06-12 Thread Alan Gates


Begin forwarded message:

> From: Eric Baldeschwieler 
> Date: June 11, 2013 10:46:25 AM PDT
> To: "common-...@hadoop.apache.org" 
> Subject: DesignLounge @ HadoopSummit
> Reply-To: common-...@hadoop.apache.org
> 
> Hi Folks,
> 
> We thought we'd try something new at Hadoop Summit this year to build upon 
> two pieces of feedback I've heard a lot this year:
> 
> Apache project developers would like to take advantage of the Hadoop summit 
> to meet with their peers to on work on specific technical details of their 
> projects
> That they want to do this during the summit, not before it starts or at 
> night. I've been told BoFs and other such traditional formats have not 
> historically worked for them, because they end up being about educating users 
> about their projects, not actually working with their peers on how to make 
> their projects better.
> So we are creating a space in the summit - marked in the event guide as 
> DesignLounge - concurrent with the presentation tracks where Apache Project 
> contributors can meet with their peers to plan the future of their project or 
> work through various technical issues near and dear to their hearts.
> 
> We're going to provide white boards and message boards and let folks take it 
> from there in an unconference style.  We think there will be room for about 4 
> groups to meet at once.  Interested? Let me know what you think.  Send me any 
> ideas for how we can make this work best for you.
> 
> The room will be 231A and B at the Hadoop Summit and will run from 10:30am to 
> 5:00pm on Day 1 (26th June), and we can also run from 10:30am to 5:00pm on 
> Day 2 (27th June) if we have a lot of topics that folk want to cover.
> 
> Some of the early topics some folks told me they hope can be covered:
> 
> Hadoop Core security proposals.  There are a couple of detailed proposals 
> circulating.  Let's get together and hash out the differences.
> Accumulo 1.6 features
> The Hive vectorization project.  Discussion of the design and how to phase it 
> in incrementally with minimum complexity.
> Finishing Yarn - what things need to get done NOW to make Yarn more effective
> If you are a project lead for one of the Apache projects, look at the 
> schedule below and suggest a few slots when you think it would be best for 
> your project to meet.  I'll try to work out a schedule where no more than 2 
> projects are using the lounge at once.  
> 
> Day 1, 26th June: 10:30am - 12:30pm, 1:45pm - 3:30pm, 3:45pm - 5:00pm
> 
> Day 2, 27th June: 10:30am - 12:30pm, 1:45pm - 3:30pm, 3:45pm - 5:00pm
> 
> It will be up to you, the hadoop contributors, from there.
> 
> Look forward to seeing you all at the summit,
> 
> E14
> 
> PS Please forward to the other -dev lists.  This event is for folks on the 
> -dev lists.
> 



Re: Single Output file from STORE command

2013-05-28 Thread Alan Gates
Nothing that uses MapReduce as an underlying execution engine creates a single 
file when running multiple reducers because MapReduce doesn't.  The real 
question is if you want to keep the file on Hadoop, why worry about whether 
it's a single file?  Most applications on Hadoop will take a directory as an 
input and read all the files contained in it.

Alan.

On May 24, 2013, at 12:11 PM, Mix Nin wrote:

> STORE command produces multiple output files. I want a single output file
> and I tried using command as below
> 
> STORE (foreach (group NoNullData all) generate flatten($1))  into '';
> 
> This command produces one single file but at the same time forces to use
> single reducer which kills performance.
> 
> How do I overcome the scenario?
> 
> Normally   STORE command produces multiple output files, apart from that I
> see another file
> "_SUCCESS" in output directory. I ma generating metadata file  ( using
> PigStorage('\t', '-schema') ) in output directory
> 
> I thought of using  getmerge as follows
> 
> *hadoop* fs -*getmerge*
> 
> But this requires
> 1)eliminating files other than data files in HDFS directory
> 2)It creates a single file in local directory but not in HDFS directory
> 3)I need to again move file from local directory to HDFS directory which
> may  take additional time , depending on size of single file
> 4)I need to agin place the files which I eliminated in Step 1
> 
> 
> Is there an efficient way for my problem?
> 
> Thanks



Fwd: Hadoop In Seoul 2013 Conference Calls For Speakers

2013-05-21 Thread Alan Gates


Begin forwarded message:

> From: "Edward J. Yoon" 
> Date: May 21, 2013 1:29:06 AM PDT
> To: gene...@hadoop.apache.org
> Subject: Hadoop In Seoul 2013 Conference Calls For Speakers
> Reply-To: gene...@hadoop.apache.org
> 
> Hi,
> 
> I'm planning the Hadoop In Seoul 2013 Open Conference with some
> organizations, such as the Korea government IT agency and OSS
> Association. We're looking for people who have interested in sharing
> about Hadoop Internals or their experience of developing applications
> with Hadoop ecosystem.
> 
> This is your great opportunity to share your insights, advanced
> technical knowledge and experience of Apache Hadoop with the Korean
> dev group. If you have an interesting presentation idea, we want to
> hear from you!
> 
> Conference topics:
> 
> * Internals
> * Ecosystem
> * Practical know-how and Case studies
> 
> We have created a Call for Speakers form to invite potential speakers
> to participate: http://dev.hadoop.co.kr/
> 
> The call for speakers will be closed on July 15th, and conference will
> be held on Saturday Aug 01-02 or Sep 05-06, 2013 (tentative schedule)
> in Seoul.
> 
> If you have any questions about this or need help about schedule and
> expenses for flights, Please feel free to contact:
> edwardy...@apache.org. If possible, we'd like to invite one+ active
> OSS committers in Apache Hadoop ecosystem.
> 
> Thanks ;)
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



Re: PIG: Transform based on value in field

2013-05-14 Thread Alan Gates
B = foreach A generate a1, (a2 == 0 ? a2 + 1 : a2) as a2, a3;

Alan.

On May 14, 2013, at 9:10 AM, Ashish Gupta wrote:

> I want to something like this
> 
> B = FOREACH A GENERATE a1, *if a2 = 0: a2=a2+1 else a2*, a3)
> 
> how to do " if a2 = 0: a2=a2+1 else a2" in PIG
> 
> (or it could be "if a2 matches < some regex>: a2+0 else a2")
> 
> 
> I am using Pig 0.10



Re: Pig Unique Counts on Multiple Subsets of a Large Input

2013-05-06 Thread Alan Gates
In the script you gave I'd be surprised if it's spending time in the map phase, 
as the map should be very simple.  It's the reduce phase I'd expect to be very 
expensive because your mapping UDF prevents Pig from using the algebraic nature 
of count (that is, it has to ship all of the records to reduce not just the 
number of records).  If your file is large this will be expensive.  What 
happens if you switch your script to:

A = load ...
B = foreach A generate dimA, dimB, udf.newUserIdForCategory1(userId, activity) 
as userId1, ...
C = group B by dimA, dimB
D = foreach C generate flatten(group), COUNT(userId1), ...

When you said it was taking a long time in the map phase were you trying 
something like the above?  If so I'd check how long your UDF is taking.  Unless 
you're reading tons of data on a very small cluster the above should be very 
fast.  It definitely should not reread the input for each UDF.

Other things to check:
What's your parallel count set at?  That is, how many reducers are you running?
How many waves of maps does this create?  That is, what's the number of maps 
this produces divided by the number of slots you get on your cluster to run it?

Alan.

On May 5, 2013, at 8:11 PM, Thomas Edison wrote:

> Hi there,
> 
> I have a huge input on an HDFS and I would like to use Pig to calculate
> several unique metrics. To help explain the problem more easily, I assume
> the input file has the following schema:
> 
> userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray,
> dimensionC_key:chararray, activity:chararray, ...
> 
> Each record represent an activity performed by that userId.
> 
> Based on the value in the activity field, this activity record will be
> mapped to 1 or more categories. There are about 10 categories in total.
> 
> Now I need to count the number of unique users for different dimension
> combinations (i.e. A, B, C, A+B, A+C, B+C, A+B+C) for each activity
> category.
> 
> What would be the best practices to perform such calculation?
> 
> I have tried several ways. Although I can get the results I want, it takes
> a very long time (i.e. days). What I found is most of the time is spent on
> the map phase. It looks like the script tries to load the huge input file
> every time it tries to calculate one unique count. Is there a way to
> improve this behavior?
> 
> I also tried something similar to below, but it looks like it reaches the
> memory cap for a single reducer and just stuck at the last reducer step.
> 
> source = load ... as (userId:chararray, dimensionA_key:chararray,
> dimensionB_key:chararray, dimensionC_key:chararray,
> activity:chararray, ...);
> a = group source by (dimensionA_key, dimensionB_key);
> b = foreach a {
>userId1 = udf.newUserIdForCategory1(userId, activity);
>-- this udf returns the original user id if the activity should be
> mapped to Category1 and None otherwise
>userId2 = udf.newUserIdForCategory2(userId, activity);
>userId3 = udf.newUserIdForCategory3(userId, activity);
>...
>userId10 = udf.newUserIdForCategory10(userId, activity);
>generate FLATTEN(group), COUNT(userId1), COUNT(userId2),
> COUNT(userId3), ..., COUNT(userId10);
> }
> store b ...;
> 
> Thanks.
> 
> T.E.



Re: Hbase Hex Values

2013-05-06 Thread Alan Gates
I am not aware of any built in or Piggybank UDF that converts Hex to Int, but 
it would be a welcome contribution if you wanted to write it.

Alan.

On May 5, 2013, at 8:14 PM, John Meek wrote:

> Hey all,
> 
> If I need to load a Hbase table with Hex values into Pig, does that require a 
> specific UDF? IS there any inbuilt function in Pig? I searched the 
> documentation but cannot find anything that lets me convert Hex to Int.
> 
> 
> JM



Fwd: CfP 2013 Workshop on Middleware for HPC and Big Data Systems (MHPC'13)

2013-04-25 Thread Alan Gates


Begin forwarded message:

> From: MHPC 2013 
> Date: April 24, 2013 10:23:55 AM PDT
> To: u...@hadoop.apache.org
> Subject: Fwd: CfP 2013 Workshop on Middleware for HPC and Big Data Systems 
> (MHPC'13)
> Reply-To: u...@hadoop.apache.org
> 
> 
> we apologize if you receive multiple copies of this message
> ===
> 
> CALL FOR PAPERS
> 
> 2013 Workshop on
> 
> Middleware for HPC and Big Data Systems
> 
> MHPC '13
> 
> as part of Euro-Par 2013, Aachen, Germany
> 
> ===
> 
> Date: August 27, 2012
> 
> Workshop URL: http://m-hpc.org
> 
> Springer LNCS
> 
> SUBMISSION DEADLINE:
> 
> May 31, 2013 - LNCS Full paper submission (rolling abstract submission)
> June 28, 2013 - Lightning Talk abstracts
> 
> 
> SCOPE
> 
> Extremely large, diverse, and complex data sets are generated from
> scientific applications, the Internet, social media and other applications.
> Data may be physically distributed and shared by an ever larger community. 
> Collecting, aggregating, storing and analyzing large data volumes 
> presents major challenges. Processing such amounts of data efficiently
> has been an issue to scientific discovery and technological
> advancement. In addition, making the data accessible, understandable and
> interoperable includes unsolved problems. Novel middleware architectures,
> algorithms, and application development frameworks are required.
> 
> In this workshop we are particularly interested in original work at the
> intersection of HPC and Big Data with regard to middleware handling
> and optimizations. Scope is existing and proposed middleware for HPC
> and big data, including analytics libraries and frameworks. 
> 
> The goal of this workshop is to bring together software architects, 
> middleware and framework developers, data-intensive application developers
> as well as users from the scientific and engineering community to exchange
> their experience in processing large datasets and to report their scientific
> achievement and innovative ideas. The workshop also offers a dedicated forum
> for these researchers to access the state of the art, to discuss problems
> and requirements, to identify gaps in current and planned designs, and to
> collaborate in strategies for scalable data-intensive computing.
> 
> The workshop will be one day in length, composed of 20 min paper
> presentations, each followed by 10 min discussion sections.
> Presentations may be accompanied by interactive demonstrations.
> 
> 
> TOPICS 
> 
> Topics of interest include, but are not limited to: 
> 
> - Middleware including: Hadoop, Apache Drill, YARN, Spark/Shark, Hive, Pig, 
> Sqoop,
> HBase, HDFS, S4, CIEL, Oozie, Impala, Storm and Hyrack
> - Data intensive middleware architecture
> - Libraries/Frameworks including: Apache Mahout, Giraph, UIMA and GraphLab
> - NG Databases including Apache Cassandra, MongoDB and CouchDB/Couchbase
> - Schedulers including Cascading
> - Middleware for optimized data locality/in-place data processing
> - Data handling middleware for deployment in virtualized HPC environments
> - Parallelization and distributed processing architectures at the middleware 
> level
> - Integration with cloud middleware and application servers
> - Runtime environments and system level support for data-intensive computing
> - Skeletons and patterns
> - Checkpointing
> - Programming models and languages
> - Big Data ETL 
> - Stream processing middleware
> - In-memory databases for HPC
> - Scalability and interoperability
> - Large-scale data storage and distributed file systems
> - Content-centric addressing and networking 
> - Execution engines, languages and environments including CIEL/Skywriting
> - Performance analysis, evaluation of data-intensive middleware
> - In-depth analysis and performance optimizations in existing data-handling
> middleware, focusing on indexing/fast storing or retrieval between compute
> and storage nodes
> - Highly scalable middleware optimized for minimum communication
> - Use cases and experience for popular Big Data middleware
> - Middleware security, privacy and trust architectures
> 
> DATES
> 
> Papers:
> Rolling abstract submission
> May 31, 2013 - Full paper submission 
> July 8, 2013 - Acceptance notification
> October 3, 2013 - Camera-ready version due 
> 
> Lightning Talks: 
> June 28, 2013 - Deadline for lightning talk abstracts
> July 15, 2013 - Lightning talk notification 
> 
> August 27, 2013 - Workshop Date 
> 
> 
> TPC
> 
> CHAIR 
> 
>   Michael Alexander (chair), TU Wien, Austria 
>   Anastassios Nanos (co-chair), NTUA, Greece
>   Jie Tao (co-chair), Karlsruhe Institut of Technology, Germany
>   Lizhe Wang (co-chair), Chinese Academy of Sciences, China
>   Gianluigi Zanetti (co-chair), CRS4, Italy
> 
> PROGRAM COMMITTEE 
> 
> Amitanand Aiyer, Facebook, USA
> Costas Bekas, IBM, Switzerland
> Jakob Blomer, CERN, Switzerland

Re: long parse time

2013-03-29 Thread Alan Gates
What version of Pig are you using?  Unreasonably long parse times were in issue 
in Pig 0.9 and 0.10, I believe those issues were fixed in Pig 0.11.

Alan.

On Mar 28, 2013, at 12:51 PM, Patrick Salami wrote:

> We have some very long pig scripts that run several times per day. We
> believe that the script parsing process takes very long (about 1h). During
> this time, the pig command just hangs before any output is displayed (I am
> assuming this is the parsing phase). My question is, can this process be
> optimized by somehow serializing the intermediate parsed script to disk
> after the parsing phase is complete so that we don't have to go through the
> parsing process each time the script is run (so long as the script itself
> does not change)? That way, we could then load and run the parsed
> representation of the script rather than re-parsing it for each run. Since
> this is probably not a readily-available feature, could someone please
> point me to the right place in the code where this intermediate output can
> be intercepted?
> 
> Thanks!



Re: Reaching source code

2013-03-14 Thread Alan Gates
You can use explain to show you the plan Pig will use to execute your script.  
This won't show you the exact Java code.  If you want to find out exactly what 
Java code is running for a particular operator the easiest thing to do is 
probably run the query in local mode and attach a debugger.

Alan.

On Mar 14, 2013, at 6:50 AM, Milind Vaidya wrote:

> Hi
> 
> I need to know what java (or otherwise ) code executes for a pig script I
> am running ("GENERATE" command to be specific).
> 
> Is there any ways to do that ?
> 
> Thanks



Re: How Pig generates DAG

2013-02-25 Thread Alan Gates
In the Pig code base check out 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java
 and 
src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java

These are the classes the control the generation of MapReduce jobs.

Alan.

On Feb 25, 2013, at 1:39 PM, Preeti Gupta wrote:

>> a set of MapReduce jobs
> 
> On Feb 25, 2013, at 1:35 PM, Alan Gates  wrote:
> 
>> Pig generates several DAGs (a logical plan, a physical plan, a set of 
>> MapReduce jobs).  Which one are you interested in?  
>> 
>> Alan.
>> 
>> On Feb 25, 2013, at 12:02 PM, Preeti Gupta wrote:
>> 
>>> Hi,
>>> 
>>> I need to do some modifications here and need to know how Pig generates 
>>> DAG. Can someone throw some light on this?
>>> 
>>> regards
>>> 
>>> preeti
>> 
> 



Re: How Pig generates DAG

2013-02-25 Thread Alan Gates
Pig generates several DAGs (a logical plan, a physical plan, a set of MapReduce 
jobs).  Which one are you interested in?  

Alan.

On Feb 25, 2013, at 12:02 PM, Preeti Gupta wrote:

> Hi,
> 
> I need to do some modifications here and need to know how Pig generates DAG. 
> Can someone throw some light on this?
> 
> regards
> 
> preeti



Re: Just started

2013-02-24 Thread Alan Gates
For books, check out 
http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645/ref=sr_1_1?ie=UTF8&qid=1361724828&sr=8-1&keywords=programming+pig
 

There's also pretty good docs on pig.apache.org under the documentation tab.

Alan.

On Feb 24, 2013, at 8:44 AM, William Kang wrote:

> Hi All,
> I just get started with Pig. It looks very interesting. I might offer
> me a great alternative with my SQL like tools.
> 
> Would you please give me some suggestions on a few good books or
> tutorials for me to continue?
> 
> Many thanks.
> 
> 
> William



Re: Reduce Tasks

2013-02-01 Thread Alan Gates
Setting that mapred.reduce.tasks won't work as Pig overrides.  See 
http://pig.apache.org/docs/r0.10.0/perf.html#parallel for info on how to set 
the number of reducers in Pig.

Alan.

On Feb 1, 2013, at 4:53 PM, Mohit Anchlia wrote:

> Just slightly different problem I tried setting SET mapred.reduce.tasks to
> 200 in pig but still more tasks were launched for that job. Is there any
> other way to set the parameter?
> 
> On Fri, Feb 1, 2013 at 3:15 PM, Harsha  wrote:
> 
>> 
>> its the total number of reducers not active reducers.
>> If you specify lower number  each reducer gets more data to process.
>> --
>> Harsha
>> 
>> 
>> On Friday, February 1, 2013 at 2:54 PM, Mohit Anchlia wrote:
>> 
>>> Thanks! Is there a downside of reducing number of reducers? I am trying
>> to
>>> alleviate high CPU.
>>> 
>>> With low reducers using parallel clause does it mean that more data is
>>> processed by each reducer or does it mean how many reducers can be active
>>> at one time
>>> 
>>> On Fri, Feb 1, 2013 at 2:44 PM, Harsha > har...@defun.org)> wrote:
>>> 
 Mohit,
 you can use PARALLEL clause to specify reduce tasks. More info here
 
>> http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features
 
 --
 Harsha
 
 
 On Friday, February 1, 2013 at 2:42 PM, Mohit Anchlia wrote:
 
> Is there a way to specify max number of reduce tasks that a job
>> should
 span
> in pig script without having to restart the cluster?
 
 
>>> 
>>> 
>>> 
>> 
>> 
>> 



Re: Run a job async

2013-01-24 Thread Alan Gates
You might want to look at webhcat's code.  It produces a servlet that it embeds 
in a jetty server.  You may be able to copy paste this to get what you want.

The code of interest is in the hcat repository under webhcat/svr.

Alan.

On Jan 24, 2013, at 9:42 AM, Prashant Kommireddi wrote:

> Thanks Alan. We are trying to plug Pig into our existing app server.
> We have already done this for Java MR. The difficulty we are facing is
> with the fact that we can use JobClient.submitJob and jobtracker's job
> end notification to run jobs async, whereas PigServer.executeBatch
> blocks until pig job is complete.
> 
> Sent from my iPhone
> 
> On Jan 24, 2013, at 9:31 AM, Alan Gates  wrote:
> 
>> If you're looking for an app server for Pig I'd take a look at a couple of 
>> other projects already out there that can do this:
>> 
>> 1) webhcat (fka Templeton, now part of the HCatalog project).  It provides a 
>> REST API that launches Pig, Hive, or MR jobs and allows you to manage them, 
>> get results, etc.  It's in HCatalog 0.5, which is in the release candidate 
>> state.  You can go to 
>> http://people.apache.org/~travis/hcatalog-0.5.0-incubating-candidate-1/ and 
>> pick up the release candidate.
>> 
>> 2) Oozie.  Oozie's a workflow engine for Hadoop, but it also supports 
>> submission of single Pig or MR jobs via REST.  It may be a little 
>> heavyweight for what you want but it works.
>> 
>> Alan.
>> 
>> On Jan 23, 2013, at 9:22 PM, Prashant Kommireddi wrote:
>> 
>>> Both. Think of it as an app server handling all of these requests.
>>> 
>>> Sent from my iPhone
>>> 
>>> On Jan 23, 2013, at 9:09 PM, Jonathan Coveney  wrote:
>>> 
>>>> Thousands of requests, or thousands of Pig jobs? Or both?
>>>> 
>>>> 
>>>> 2013/1/23 Prashant Kommireddi 
>>>> 
>>>>> Did not want to have several threads launched for this. We might have
>>>>> thousands of requests coming in, and the app is doing a lot more than only
>>>>> Pig.
>>>>> 
>>>>> On Wed, Jan 23, 2013 at 5:44 PM, Jonathan Coveney >>>>> wrote:
>>>>> 
>>>>>> start a separate Process which runs Pig?
>>>>>> 
>>>>>> 
>>>>>> 2013/1/23 Prashant Kommireddi 
>>>>>> 
>>>>>>> Hey guys,
>>>>>>> 
>>>>>>> I am trying to do the following:
>>>>>>> 
>>>>>>> 1. Launch a pig job asynchronously via Java program
>>>>>>> 2. Get a notification once the job is complete (something similar to
>>>>>>> Hadoop callback with a servlet)
>>>>>>> 
>>>>>>> I looked at PigServer.executeBatch() and it seems to be waiting until
>>>>> job
>>>>>>> completes.This is not what I would like my app to do.
>>>>>>> 
>>>>>>> Any ideas?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>> 
>>>>> 
>> 



Re: Run a job async

2013-01-24 Thread Alan Gates
If you're looking for an app server for Pig I'd take a look at a couple of 
other projects already out there that can do this:

1) webhcat (fka Templeton, now part of the HCatalog project).  It provides a 
REST API that launches Pig, Hive, or MR jobs and allows you to manage them, get 
results, etc.  It's in HCatalog 0.5, which is in the release candidate state.  
You can go to 
http://people.apache.org/~travis/hcatalog-0.5.0-incubating-candidate-1/ and 
pick up the release candidate.

2) Oozie.  Oozie's a workflow engine for Hadoop, but it also supports 
submission of single Pig or MR jobs via REST.  It may be a little heavyweight 
for what you want but it works.

Alan.

On Jan 23, 2013, at 9:22 PM, Prashant Kommireddi wrote:

> Both. Think of it as an app server handling all of these requests.
> 
> Sent from my iPhone
> 
> On Jan 23, 2013, at 9:09 PM, Jonathan Coveney  wrote:
> 
>> Thousands of requests, or thousands of Pig jobs? Or both?
>> 
>> 
>> 2013/1/23 Prashant Kommireddi 
>> 
>>> Did not want to have several threads launched for this. We might have
>>> thousands of requests coming in, and the app is doing a lot more than only
>>> Pig.
>>> 
>>> On Wed, Jan 23, 2013 at 5:44 PM, Jonathan Coveney >>> wrote:
>>> 
 start a separate Process which runs Pig?
 
 
 2013/1/23 Prashant Kommireddi 
 
> Hey guys,
> 
> I am trying to do the following:
> 
>  1. Launch a pig job asynchronously via Java program
>  2. Get a notification once the job is complete (something similar to
>  Hadoop callback with a servlet)
> 
> I looked at PigServer.executeBatch() and it seems to be waiting until
>>> job
> completes.This is not what I would like my app to do.
> 
> Any ideas?
> 
> Thanks,
> 
 
>>> 



Re: Hard-coded inline relations

2013-01-24 Thread Alan Gates
I agree this would be useful for debugging, but I'd go about it a different 
way.  Rather than add new syntax as you propose, it seems we could easily 
create an inline loader, so your script would look something like:

A = load '{(Hello), (World)}' using InlineLoader();
dump A;

Alan.

On Jan 18, 2013, at 10:49 AM, Michael Malak wrote:

> I'm new to Pig, and it looks like there is no provision to declare relations 
> inline in a Pig script (without LOADing from an external file)?
> 
> Based on
> http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants
> I would have thought the following would constitute "Hello World" for Pig:
> 
> A = {('Hello'),('World')};
> DUMP A;
> 
> But I get a syntax error.  The ability to inline relations would be useful 
> for debugging.  Is this limitation by design, or is it just not implemented 
> yet?
> 



Re: Pig error

2013-01-15 Thread Alan Gates
Could you share your script or a script that gets this error message?

Alan.

On Jan 14, 2013, at 2:19 PM, Phanish Lakkarasu wrote:

> Hi all,
> 
> When am using JOIN operator in pig, am getting following error
> 
> Pig joins inner plans can only have one output leaf?
> 
> Can any one tell me, why does this occur.
> 
> Regards
> Abhi



Re: JsonLoader schema field order shouldn't matter

2013-01-08 Thread Alan Gates
I would open a new JIRA, since 1914 is focussed on building an alternative that 
discovers schema, while you are wanting to improve the existing one.

Alan.

On Jan 7, 2013, at 5:02 PM, Tim Sell wrote:

> This seems like a bug to me. It makes it risky to work with JSON data
> generated by something other than Pig since the ordering might change.
> What do you think?
> 
> I didn't see a bug for it in Jira, so would this (still open) one be
> the place to mention it? Or should I make a new one?
> https://issues.apache.org/jira/browse/PIG-1914
> 
> ~T
> 
> 
> On 7 January 2013 20:24, Alan Gates  wrote:
>> Currently the JsonLoader does assume ordering of the fields.  It does not do 
>> any name matching against the given schema to find the right field.
>> 
>> Alan.
>> 
>> On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:
>> 
>>> When using JsonLoader with Pig 0.10.0
>>> 
>>> if I have an input.json file that looks like this:
>>> 
>>> {"date": "2007-08-25", "id": 16}
>>> {"date": "2007-09-08", "id": 17}
>>> {"date": "2007-09-15", "id": 18}
>>> 
>>> And I use
>>> 
>>> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
>>> DUMP a;
>>> 
>>> I get errors when it tries to force the date fields into an integer.
>>> 
>>> Shouldn't this work independent of the ordering of the schema fields?
>>> Json writers generally don't make guarantees about the ordering.
>>> 
>>> One alternative (though annoying) would to be use elephant bird
>>> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
>>> 0.10.0.
>>> 
>>> ~Tim
>> 



Re: JsonLoader schema field order shouldn't matter

2013-01-07 Thread Alan Gates
Currently the JsonLoader does assume ordering of the fields.  It does not do 
any name matching against the given schema to find the right field.

Alan.

On Jan 7, 2013, at 11:56 AM, Tim Sell wrote:

> When using JsonLoader with Pig 0.10.0
> 
> if I have an input.json file that looks like this:
> 
> {"date": "2007-08-25", "id": 16}
> {"date": "2007-09-08", "id": 17}
> {"date": "2007-09-15", "id": 18}
> 
> And I use
> 
> a = LOAD 'input.json' USING JsonLoader('id:int,date:chararray');
> DUMP a;
> 
> I get errors when it tries to force the date fields into an integer.
> 
> Shouldn't this work independent of the ordering of the schema fields?
> Json writers generally don't make guarantees about the ordering.
> 
> One alternative (though annoying) would to be use elephant bird
> instead, but I can't get that to compile against hadoop 2.0.0 and Pig
> 0.10.0.
> 
> ~Tim



Re: Multiple input file

2012-12-22 Thread Alan Gates
Yes.  See http://pig.apache.org/docs/r0.10.0/basic.html#load for a discussion 
of how to use globs in file paths.

Alan.
 
On Dec 21, 2012, at 10:38 PM, Mohit Anchlia wrote:

> Is it possible to load multiple files in the same load command? I have
> files in different path that I need to load, is that possible?



Re: pig ship tar files

2012-12-20 Thread Alan Gates
See http://pig.apache.org/docs/r0.10.0/basic.html#define-udfs especially the 
section on SHIP.

Alan.

On Dec 20, 2012, at 10:01 AM, Danfeng Li wrote:

> I read alot of about pig can ship a tar file and untar it before execution. 
> However, I couldn't find any example. Can someone provide an example?
> 
> What I would like to do is to ship a python module, such as nltk, for my 
> streaming.
> 
> Thanks.
> 
> Dan
> 
> 



Re: Do we have any plan for "Cost based optimizer"?

2012-12-06 Thread Alan Gates
I am not aware of any work going on for this or plans in this area at the 
moment.

Alan.

On Dec 4, 2012, at 6:32 PM, lulynn_2008 wrote:

> Hi All,
> 
> I just noticed that Pig Committer DaiJianYong has mentioned "Cost based 
> optimizer" for pig performanceoptimization.
> My question are:
> Do we have any plan for this new feature? Like which branch will include this 
> feature?
> Is there any task which trace this feature now?
> 
> Thanks



Re: Physical Plan

2012-11-26 Thread Alan Gates
No, it need not be binary.  A split can have multiple children.

Alan.

On Nov 17, 2012, at 4:32 PM, Sarah Mohamed wrote:

> Is the Physical Plan binary tree ? (i.e. Could any node have more than two
> Physical Operators child ?)
> --
> Regards,
> Sarah M. Hassan



Re: computing avg in pig

2012-11-06 Thread Alan Gates
A = load 'input_file';
B = group A all;
C = foreach B generate AVG(A.$1);

This groups all of your records into one bag and then takes the average of the 
second column.

Alan.

On Nov 6, 2012, at 11:19 AM, jamal sasha wrote:

>> I have data in format
> 
>> 
>> 
>>1,1.2
>> 
>>2,1.3
>> 
>>and so on..
>> 
>> 
>> 
>> So basically this is id, val combination where id is unique...
>> 
>> 
>> 
>> I want to calculate the average of all the values..
>> 
>> 
>> 
>>So here.. avg(1.2,1.3)
>> 
>> 
>> 
>> I was going thru the documentation but most of the aggregation function
> involves grouping by some id.. and then using AVG... but since the id is
> unique.. how do I group them???
>> 
>> So basically the outcome of this endeavor would be one float..
>> 
>> Any suggestions will be greatly appreciated.
>> 
>> Thanks



Re: CONCAT(null, "something") == NULL ?

2012-11-05 Thread Alan Gates
Better in terms of semantics or terms of documentation?  We can't change the 
semantics of null in Pig; it's been that way the whole time.  Plus this concept 
of unknown data is important in data processing.  If we had it to do over again 
we could name it 'unknown' instead of null, but it seems late for that now.

Alan.

On Nov 2, 2012, at 3:40 PM, Cheolsoo Park wrote:

> Hi Alan,
> 
> Recently, I have seen several similar confusions about nulls in Pig. For
> example, here is another discussion:
> https://issues.apache.org/jira/browse/PIG-3021.
> 
> We are documenting them, but apparently, many users find it confusing. I am
> wondering if there is anything that we can do better.
> 
> Thanks,
> Cheolsoo
> 
> On Fri, Nov 2, 2012 at 3:33 PM, Alan Gates  wrote:
> 
>> To give some context, the null semantics in Pig follow SQL's.  In SQL,
>> null is viral, so any operation with null results in null.  The idea is
>> that null means unknown, not empty.  So concat('x', unknown) = unknown.
>> 
>> Alan.
>> 
>> On Nov 2, 2012, at 3:09 PM, Yang wrote:
>> 
>>> looks a more intuitive result should be "something" , right?
>>> 
>>> but on my system it gave null
>> 
>> 



Re: CONCAT(null, "something") == NULL ?

2012-11-02 Thread Alan Gates
To give some context, the null semantics in Pig follow SQL's.  In SQL, null is 
viral, so any operation with null results in null.  The idea is that null means 
unknown, not empty.  So concat('x', unknown) = unknown.

Alan.

On Nov 2, 2012, at 3:09 PM, Yang wrote:

> looks a more intuitive result should be "something" , right?
> 
> but on my system it gave null



Re: Is that possible to use Pig to do an optimized secondary sort.

2012-10-31 Thread Alan Gates
Seeing your Pig Latin script will help us determine whether this will work in 
your case.  But in general Pig uses secondary sort when you do an order by in a 
nested foreach.  So if you are grouping you could order within that group and 
then pass it to your UDF.

Alan.

On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:

> Dear buddies,
> 
> We are trying to write some of the UDF to do some machine learning work. We
> did a simple experiment to calculate the AUC through a UDF like the
> following code in gist
> 
> https://gist.github.com/3985764
> 
> The map-reduce job will only take a couple of few minutes, but will wait
> there hours to do the cleanup.
> 
> I guess the reason is that the sort inside the foreach will generate lots
> of data spill to local fs and takes a long time to do cleanup there.
> 
> In a java map-reduce problem, we could made it like a secondary sort. We
> make the model + ctr as the key so the same model's ctr will be sorted, and
> group by only the model name part, then the sort is done after shuffling.
> 
> I  am wondering if we could do that kind of optimization in pig as well?



Re: Reading fixed width files in pig

2012-10-26 Thread Alan Gates
I am not aware of any.

Alan.

On Oct 23, 2012, at 6:03 AM, ranjith raghunath wrote:

> Team,
> 
> Are any out of the box load functions for fixed width files?



Re: Welcome our newest committer Cheolsoo Park

2012-10-26 Thread Alan Gates
Welcome Cheolsoo, and well deserved.

Alan.

On Oct 26, 2012, at 2:54 PM, Julien Le Dem wrote:

> All,
> 
> Please join me in welcoming Cheolsoo Park as our newest Pig committer.
> He's been contributing to Pig for a while now, helping fixing the
> build and improve Pig. We look forward to him being a part of the
> project.
> 
> Julien



Re: FOREACH GENERATE Conditional?

2012-10-24 Thread Alan Gates
Are you sure Pig is spawning extra map jobs for this?  The multi-query 
optimizer should be pushing these back together into one job.

If it isn't, you should be able to accomplish the same thing with trinary logic 
and a single filter:

all = foreach main_set ((blah == 'a' and meh == 'b') ? 'likes' : ((blah == 'b' 
and meh == 'c') ? 'disklikes' : ((blah == 'c' and meh =='d') ? 'newuserregs' : 
''))) as type;
all_time = filter all by type != '';

(Not sure about all the parenthesis placement, as I didn't run it.)

Alan.

On Oct 24, 2012, at 2:51 AM, Eli Finkelshteyn wrote:

> Hi folks,
> I have a pig script that right now looks like this:
> 
> …
> likes = FILTER main_set BY blah == 'a' AND meh == 'b';
> likes_time = FOREACH likes GENERATE date, 'likes' AS type;
> 
> dislikes = FILTER main_set BY blah == 'b' AND meh == 'c';
> dislikes_time = FOREACH dislikes GENERATE date, 'dislikes' AS type;
> 
> newuserregs = FILTER main_set BY blah == 'c' AND meh == 'd';
> newuserregs_time = FOREACH dislikes GENERATE date, 'newuserregs' as type;
> ...
> 
> all_time = UNION likes_time, dislikes_time, newuserregs_time;
> …
> 
> As you can see, what I'm doing is filtering the main_set repeatedly and 
> generating based on that, and then unioning everything back together. This 
> means a lot of extra map jobs, which is a lot of extra work. Really, thinking 
> about it in terms of mapping, I should be able to do things in one run. Any 
> idea what the pig syntax would be for that? Is there something like a 
> GENERATE conditional, where I could do something like:
> 
> all_time = FOREACH main_set GENERATE date, 'likes' IF (blah == 'a' AND meh == 
> 'b')
>   
>   'dislikes' IF (blah == 'b' AND meh == 'c')
>   
>   'dislikes' IF (blah == 'c' AND meh == 'd') AS type;
> 
> Running this in just one map job would be very awesome and would speed this 
> script up a ton, I'm thinking. Ideas? Advice?
> 
> Eli



Re: About full pipeline between pig jobs

2012-10-22 Thread Alan Gates
At this point, no.  In the current MapReduce infrastructure it would take a lot 
of hackery that breaks the MR abstraction to make this work[1].  This is one 
thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka YARN) where it 
is easier for applications to build these types of features.

[1]  Details on why this is so:  Assume you want to pipeline two jobs.  When 
job 1 gets to it's reduces, it has to pause until job 2 starts, because it 
can't know where job 2's map tasks will run a priori.  Job 1's reducer has to 
be able to handle the case where job 2's map task fails and it needs to restart 
the streaming, which means it has to spool to HDFS anyway.  In the same way job 
2's map tasks need to be able to handle failure and restart of job 1's reducer 
(which is easier, they could just die).  Plus you need to handle the 
possibility of dead locks (ie, so much of your cluster or your user's quota may 
be taken up by job 1 that job 2 will never start or get enough map tasks until 
job 1 ends).  Current MapReduce strongly discourages intertask communication 
for exactly these reasons.

Alan.

On Oct 22, 2012, at 3:34 AM, W W wrote:

> Hello,
> 
> I wonder if M/R jobs compiled from pig script support pipeline between jobs.
> 
> For example, let's assume there  are 5 independent consecutive M/R jobs
> doing some joining and aggregating task.
> My question is can one job be started before it's previous job finished so
> that the previous job doesn't need to write all the output data from reduce
> to HDFS , I just can't find any material talking about this.
> 
> I think  Abinitio is a good example for the full pipeline architecture.
> 
> Thanks & Regards
> Stephen



Re: There's nothing like an "#include" statement for splicing common text into a pig script, right?

2012-10-11 Thread Alan Gates
See http://pig.apache.org/docs/r0.10.0/cont.html#import-macros

Alan.

On Oct 11, 2012, at 7:36 AM, Trager Corey wrote:

> Several scripts start by loading the same file.  I'd like to have the text 
> for the field names and types in one place.  Doable?
> 
> 
> The information contained in this communication may be CONFIDENTIAL and is 
> intended only for the use of the recipient(s) named above. If you are not the 
> intended recipient, you are hereby notified that any dissemination, 
> distribution, or copying of this communication, or any of its contents, is 
> strictly prohibited. If you have received this communication in error, please 
> notify the sender and delete/destroy the original message and any copy of it 
> from your computer or paper files.



Re: Decide if function is algebraic at planning phase

2012-10-09 Thread Alan Gates
There is one way you could shoe-horn this in.  EvalFuncs can implement 
funcToArgMapping, which is built to allow functions to pick a different 
instance of themselves for different types (e.g. SUM(long) vs SUM(double)).  
You could implement your logic in this function and then return an EvalFunc 
with or without Algebraic implemented based on your choice.

Alan.

On Oct 8, 2012, at 12:01 PM, Ugljesa Stojanovic wrote:

> I would like to be able to decide if I want to use the Algebraic or regular
> implementation of an EvalFunc on the front end (planning phase), preferably
> in the function constructor. Is there any way to do this? If I implement
> the interface the planner will automatically attempt to use it. Returning
> null when implementing it in those cases also doesn't work.
> 
> Thanks,
> Ugljesa



Re: Question about UDFs and tuple ordering

2012-10-05 Thread Alan Gates
Many operators, such as join and group by, are not implemented by a single 
physical operation.  Also, they are spread through the code as they have 
logical components and physical components.  The logical components of join are 
in org.apache.pig.newplan.logical.relational.LOJoin.java.  That gets translated 
to three physical operators, POLocalRearrange, POPackage, and POForeach.  All 
of the physical operators are in 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators

Alan.

On Oct 5, 2012, at 11:01 AM, Brian Stempin wrote:

> Thanks Russell -- That's really useful.
> 
> Just for kicks and giggles:  Where would I look in the code base to see how 
> the JOIN keyword is implemented?  I've found the built in functions, but not 
> the keywords (JOIN, GROUP, etc).  Perhaps that would give me some hints.  
> Perhaps it'll show me that a UDF might not be the best option for my set of 
> problems.
> 
> Thanks once again,
> Brian
> 
> 
> This e-mail is intended solely for the above-mentioned recipient and it may 
> contain confidential or privileged information. If you have received it in 
> error, please notify us immediately and delete the e-mail. You must not copy, 
> distribute, disclose or take any action in reliance on it. In addition, the 
> contents of an attachment to this e-mail may contain software viruses which 
> could damage your own computer system. While ColdLight Solutions, LLC has 
> taken every reasonable precaution to minimize this risk, we cannot accept 
> liability for any damage which you sustain as a result of software viruses. 
> You should perform your own virus checks before opening the attachment.



Re: Loading text file

2012-10-03 Thread Alan Gates
There is not a pre-built load function to do that.  In fact I am not aware of a 
Hadoop InputFormat that does that.  So you would first need to subclass 
Hadoop's FileInputFormat and then write a Load Func.  Both should be fairly 
straight forward since all you need to do is remove the record and field 
parsing from existing code.

Alan.

On Oct 3, 2012, at 9:49 AM, JAGANADH G wrote:

> Hi All
> 
> Is there any way to load a text file as single record (text:chararray) in
> Pig.
> 
> I am trying to load a bunch of text files from a directory . But it keeps
> each line as single record.
> 
> 
> -- 
> **
> JAGANADH G
> http://jaganadhg.in
> *ILUGCBE*
> http://ilugcbe.org.in



Re: Using matches in generate clause?

2012-09-27 Thread Alan Gates
In Pig 0.9 boolean was not yet a first class data type, so boolean types were 
not allowed in foreach statements.  In Pig 0.10 boolean became a first class 
type, so expressions that return booleans (such as matches) should work.

Alan.


On Sep 27, 2012, at 10:34 AM, pablomar wrote:

> no idea why, but matches works with FILTER but it doesn't with FOREACH
> I've tried with pig 0.9.2
> 
> example (this works):
> b = filter html_pages by html matches 'some pattern';
> 
> 
> if you still want to do it with foreach, you can write your UDF, something
> like:
> 
> public class MyMatch extends EvalFunc 
> {
>  public Boolean exec(Tuple input) throws IOException
>  {
>try
>{
>  String pattern = (String)input.get(0);
>  String value = (String)input.get(1);
> 
>  return value.matches(pattern);
>}
>catch(Exception e)
>{
>  throw WrappedIOException.wrap("ouch!", e);
>}
>  }
> }
> 
> 
> and use it just like this:
> 
> b = foreach html_pages generate portal_id, MyMatch('some pattern', html) as
> wp_match;
> 
> 
> 
> 
> On Thu, Sep 27, 2012 at 12:38 PM, Alan Gates  wrote:
> 
>> What version of Pig are you using?
>> 
>> Alan.
>> 
>> On Sep 27, 2012, at 8:54 AM, James Kebinger wrote:
>> 
>>> Hello, I'm having some trouble doing something I thought would be easy:
>> I'd
>>> like to use matches to generate a boolean flag but this seems to not
>>> compile:
>>> 
>>> FOREACH html_pages GENERATE portal_id, html matches 'some pattern' as
>>> wp_match:boolean;
>>> 
>>> I've tried wrapping it in parens too, with no luck.
>>> 
>>> Is this possible, or am I out of luck?
>>> 
>>> thanks
>> 
>> 



Re: Using matches in generate clause?

2012-09-27 Thread Alan Gates
What version of Pig are you using?

Alan.

On Sep 27, 2012, at 8:54 AM, James Kebinger wrote:

> Hello, I'm having some trouble doing something I thought would be easy: I'd
> like to use matches to generate a boolean flag but this seems to not
> compile:
> 
> FOREACH html_pages GENERATE portal_id, html matches 'some pattern' as
> wp_match:boolean;
> 
> I've tried wrapping it in parens too, with no luck.
> 
> Is this possible, or am I out of luck?
> 
> thanks



Re: How can I access secure HBase in UDF

2012-09-25 Thread Alan Gates
You can use the UDFContext to pass information for the UDF in the JobConf 
without writing files.  

Alan.

On Sep 25, 2012, at 10:48 AM, Rohini Palaniswamy wrote:

> Ray,
>   Looking at the EvalFunc interface, I can not see a way or loophole to do
> it.  EvalFunc does not have a reference to Job or JobConf object to add
> credentials to it. It has getCacheFiles() to add files to DistributedCache,
> but no method to add credentials. We should probably add one. The not so
> nice workaround I can think of is to add the token as a file
> to DistributedCache using getCacheFiles() and read the file yourself in
> EvalFunc and use it in communication with HBase.
> 
> Regards,
> Rohini
> 
> On Tue, Sep 25, 2012 at 1:51 AM, Ray  wrote:
> 
>> Hi,
>> 
>> I have a requirement to access HBase in UDF. But the HBase is configured to
>> be secure, which needs a credential when being connected in a mapreduce
>> job.
>> I see you have added support of secure HBase in HBaseStorage
>> https://issues.apache.org/jira/browse/PIG-2821?attachmentSortBy=dateTime
>> But my UDF is an EvalFunc not Loader.
>> 
>> Could you tell me how I can achieve this? Or is there any way I can add the
>> credential in Job when the job is setup in backend?
>> 
>> Thanks,
>> Ray
>> 



Re: Removing unnecessary disambiguation marks

2012-09-18 Thread Alan Gates
The added foreach will not generate another MR job.  

Alan.

On Sep 18, 2012, at 8:54 AM, Ruslan Al-Fakikh wrote:

> Hey,
> 
> You can try cleaning in a separate FOREACH. I don't think it'll
> trigger another MR job, but you better check it.
> Example:
> resultCleaned = FOREACH result GENERATE
>   name::group::fileldName AS 
> fileldName;
> 
> Ruslan
> 
> On Tue, Sep 18, 2012 at 3:01 AM, Robert Yerex
>  wrote:
>> Probably an easy one but...
>> 
>> After processing a file through a series of groupings, aggreagtions and
>> projections using flatten I end up with long concatenated names for each
>> field shown in this snippre t from the JsonStorage generated schema
>> 
>>{
>> 
>> "name"
>> :"enrollments_instructor_1::enrollments_student_3::enrollments_student_2::enrollments_student_1::enrollments_section::enrollments::term::term_id"
>> ,
>> 
>>"type":55,
>> 
>>"description":"autogenerated from Pig Field Schema",
>> 
>>"schema":null
>> 
>>},
>> How do I get rid of all the concatenated naming?
>> 
>> --
>> Robert Yerex
>> Data Scientist
>> Civitas Learning
>> www.civitaslearning.com
>> 
>> 
>> 
>> 
>> --
>> Robert Yerex
>> Data Scientist
>> Civitas Learning
>> www.civitaslearning.com



Re: How to force the script finish the job and continue the follow script?

2012-09-16 Thread Alan Gates
'exec' will force your job to start.  However, I strongly doubt this will solve 
your OOME problem, as some one part of your job is running out of memory.  
Whichever part that is will still fail I suspect.  Pig jobs don't generally 
accrue memory as they go since most memory intensive operations are done in 
their own task.  If you can isolate the part of your script that is causing an 
OOME (which exec should help with) and send that portion to the list we may be 
able to help figure out what's causing the issue.

Alan.

On Sep 15, 2012, at 10:52 PM, Haitao Yao wrote:

> Hi, all
>   I forgot the keyword which force Pig to finish the job and then 
> continue the following script.
>   My job failed because of OOME, so I want to split the jobs into smaller 
> ones but still written in a single pig script(because the script is 
> generated) .
>   Is there any keywords that can achieve this?
>   thanks.
> 
> 
> 
> Haitao Yao
> yao.e...@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 



Re: access schema defined in LOAD statement in custom LoadFunc?

2012-09-15 Thread Alan Gates
Unfortunately, no.  I agree we should add that to the LoadFunc interface.

Alan.

On Sep 15, 2012, at 1:13 AM, Jim Donofrio wrote:

> Is there anyway within a LoadFunc to access the schema that a user defines 
> after AS in a LOAD statement? Is there some property I can access in the 
> UDFContext or ? pushProjection provides the schema from a FOREACH GENERATE 
> and getSchema seems to be only meant to read part of the actual file or a 
> JSON schema file off disk.



Re: Json and split into multiple files

2012-09-12 Thread Alan Gates
I don't understand your use case or why you need to use exec or outputSchema.  
Would it be possible to send a more complete example that makes clear why you 
need these?

Alan.

A tuple can contain a tuple, so it's certainly possible with outputSchema() to 
generate a schema that declares both your tuples.  But I don't think this 
answers your questions.

On Sep 7, 2012, at 10:21 AM, Mohit Anchlia wrote:

> It looks like I can use outputSchema(Schema input) call to do this. But
> examples I see are only for one tuple. In my case if I am reading it right
> I need tuple for each dimension and hence schema for each. For instance
> there'll be one user tuple and then product tuple for instance. So I need
> schema for each.
> 
> How can I do this using outputSchema such that result is like below where I
> can access each tuple and field that is a named field? Thanks for your help
> 
> A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name:
> chararray), product: tuple(id: int, name:chararray))
> 
> On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia wrote:
> 
>> I have a Json something like:
>> 
>> {
>> user{
>> id : 1
>> name: user1
>> }
>> product {
>> id: 1
>> name: product1
>> }
>> }
>> 
>> I want to be able to read this file and create 2 files as follows:
>> 
>> user file:
>> key,1,user1
>> 
>> product file:
>> key,1,product1
>> 
>> I know I need to call exec but the method will return Bags for each of
>> these dimensions.  But since it's all unordered how do I split it further
>> to write them to separate files?
>> 



Re: Storing field in a bag

2012-09-10 Thread Alan Gates
You can achieve equivalent functionality by saying:

page = foreach b generate page;
store page into '/flume_vol/flume/input/page.dat';
network = foreach b generate network;
store network into '/flume_vol/flume/input/network.dat';

Alan.
On Sep 10, 2012, at 4:05 PM, Ruslan Al-Fakikh wrote:

> Hi, Mohit,
> 
> http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#STORE
> I guess you can only STORE relations, not fields, etc
> 
> Ruslan
> 
> On Mon, Sep 10, 2012 at 9:53 PM, Mohit Anchlia  wrote:
>> I am trying to store field in a bag command but it fails with
>> 
>> store b.page into '/flume_vol/flume/input/page.dat';
>> store b.network into '/flume_vol/flume/input/network.dat';
>> 
>> B: {b: {(page: chararray,network: chararray,sysinfo:
>> chararray,trafficsource: chararray,search: chararray)}}
>> 2012-09-10 10:45:54,293 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1200:   mismatched input
>> '.' expecting INTO
>> Details at logfile: /root/.mohit/pigtest/pig_1347299107910.log
>> 
>> --
>> 
>> Can I do this without using foreach?



Re: Json and split into multiple files

2012-09-06 Thread Alan Gates
Loading the JSON below should give you a Pig record like:
(user: tuple(id: int, name: chararray), product: tuple(id: int, name:chararray))

In that case your Pig Latin would look like:

A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name: 
chararray), product: tuple(id: int, name:chararray))
B = foreach A generate user.id, user.name;
store B into 'userfile';
C = foreach A generate product.id, product.name;
store C info 'productfile';

I'm not sure what key is, so I'm not sure the above is what you're thinking or 
not.

Alan.

On Sep 5, 2012, at 12:04 PM, Mohit Anchlia wrote:

> Any pointers would be appreciated
> 
> On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia wrote:
> 
>> I have a Json something like:
>> 
>> {
>> user{
>> id : 1
>> name: user1
>> }
>> product {
>> id: 1
>> name: product1
>> }
>> }
>> 
>> I want to be able to read this file and create 2 files as follows:
>> 
>> user file:
>> key,1,user1
>> 
>> product file:
>> key,1,product1
>> 
>> I know I need to call exec but the method will return Bags for each of
>> these dimensions.  But since it's all unordered how do I split it further
>> to write them to separate files?
>> 



Re: Count of all the rows

2012-09-04 Thread Alan Gates
Expression in Pig can only have tuples and bags.  A tuple is a single record.  
It has a defined number of fields.  Those fields are in a defined order.  They 
can be given names and types may be assigned to them.  Thus it is reasonable to 
speak of the 3rd field or the field named "user".  That field may be of any 
data type supported by Pig.  Constant tuples are denoted by ().

A bag is an unordered collection of tuples.   So you can never say the "3rd 
tuple of a bag" as it has no meaning.  We do sometimes get sloppy and discuss a 
schema of a bag, but what we really mean is a schema that applies to all tuples 
inside the bag.  Constant bags are denoted by {}.

Given these definitions it seems like what we assign to on the left hand side 
of Pig Latin scripts could be thought of as bags, since they are (usually 
unordered) collections of tuples.  However, there is a distinction.  You cannot 
(usually) use these in expressions such as COUNT().  And bags cannot be 
assigned to nor used in places where you would expect a relation name.  Thus we 
distinguish these by calling them relations.  So in the script:
A = load 'foo';
B = group A by name;
C = foreach B generate name, COUNT(A);

A is playing two rolls.  In the first and second lines it is a relation.  In 
the third line it is a bag named after the relation it came from.

All of this gets a little fuzzier when you consider nested foreach operators, 
but I've ignored that for now.  Hope this helps.

Alan.

On Aug 30, 2012, at 9:57 AM, Mohit Anchlia wrote:

> I looked at definition of Relation which says:
> 
> 
> A relation is a bag (more specifically, an outer bag).
> If relation is a bag then what's the difference between a Bag and Relation.
> I am getting bit confused on the definitions. In below example what would
> be Telation, Tuple or a Bag?
> 
> (1,2,3,4)
> 
> Is 1,2,3,4 without "(" is a tuple? Then what is a Relation or a Bag?
> 
> On Wed, Aug 29, 2012 at 4:51 PM, Jonathan Coveney wrote:
> 
>> COUNT is a UDF that takes in a Bag and outputs a Double.
>> 
>> Relations are not Bags, so that's one way of thinking about it. But of
>> course, we could have coerced the syntax to make it work.
>> 
>> I like to think of it as such:
>> 
>> A foreach is a transformation on the rows of a relation. Thus, applying
>> COUNT directly to a relation doesn't make any sense, since you're doing an
>> aggregate transformation. This is why grouping is necessary. you're putting
>> all of the rows of the relation into one row (with the catch-all key
>> "all"), so that you can run a function on them.
>> 
>> Don't know if that helps.
>> 
>> 2012/8/29 Mohit Anchlia 
>> 
>>> Thanks! Why is grouping necessary? Is it to send it to the reducer?
>>> 
>>> On Wed, Aug 29, 2012 at 4:03 PM, Alan Gates 
>> wrote:
>>> 
>>>> A = load 'foo';
>>>> B = group A all;
>>>> C = foreach B generate COUNT(A);
>>>> 
>>>> Alan.
>>>> On Aug 29, 2012, at 3:51 PM, Mohit Anchlia wrote:
>>>> 
>>>>> How do I get count of all the rows? All the examples of COUNT use
>> group
>>>> by.
>>>> 
>>>> 
>>> 
>> 



Re: Count of all the rows

2012-08-29 Thread Alan Gates
Even in SQL when you do select count(*) you are actually grouping, the language 
just hides it from you.  

Each map/combiner counts the number of records it sees and sends that count to 
the reducer which sums the counts.  

Alan.

On Aug 29, 2012, at 4:41 PM, Mohit Anchlia wrote:

> Thanks! Why is grouping necessary? Is it to send it to the reducer?
> 
> On Wed, Aug 29, 2012 at 4:03 PM, Alan Gates  wrote:
> 
>> A = load 'foo';
>> B = group A all;
>> C = foreach B generate COUNT(A);
>> 
>> Alan.
>> On Aug 29, 2012, at 3:51 PM, Mohit Anchlia wrote:
>> 
>>> How do I get count of all the rows? All the examples of COUNT use group
>> by.
>> 
>> 



Re: Count of all the rows

2012-08-29 Thread Alan Gates
A = load 'foo';
B = group A all;
C = foreach B generate COUNT(A);

Alan.
On Aug 29, 2012, at 3:51 PM, Mohit Anchlia wrote:

> How do I get count of all the rows? All the examples of COUNT use group by.



Re: Help with Log Processing

2012-08-24 Thread Alan Gates
The issue you're going to run into is that Pig's default load function uses 
FileInputFormat, which always divides records on line end.  You could clone 
FileInputFormat and twiddle your version to break on paragraph ends instead of 
line ends.  You could then make a version of PigStorage that uses your new 
InputFormat instead of FileInputFormat.

Alan.

On Aug 20, 2012, at 12:42 AM, Siddharth Tiwari wrote:

> 
> Hi Firends.
> I have a set of logs in the following format
> 
> 2012-07-22-22.44.46.649189   Instance:pvdd143   Node:000
> PID:23068894(db2agent (PVSS143D) 0)   TID:9884   
> Appid:*LOCAL.pvdd143.120723053935
> relation data serv  sqlrreorg_index_obj Probe:555   Database:PVSS143D
> ADM9520I  Reorganizing partitioned index IID "2" (OBJECTID "13") in table 
> space
> "SITIN003" (ID "5") for data partition "8" of table "TITIN00 .ITNRY_XFER_STA"
> (ID "-32767") in table space "SITIN003" (ID "-6").
> ^^
> 2012-07-22-22.44.46.649615   Instance:pvdd143   Node:000
> PID:23068894(db2agent (PVSS143D) 0)   TID:9884   
> Appid:*LOCAL.pvdd143.120723053935
> relation data serv  sqlrreorg_index_obj Probe:555   Database:PVSS143D
> ADM9520I  Reorganizing partitioned index IID "3" (OBJECTID "13") in table 
> space
> "SITIN003" (ID "5") for data partition "8" of table "TITIN00 .ITNRY_XFER_STA"
> (ID "-32767") in table space "SITIN003" (ID "-6").
> 
> 
> I need to read each paragraph at once rather than one line so that I can 
> establish a relationship between each logged para.
> Please help, how to achieve it in PIG.
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain 
> confidential or privileged information. If you are 
> not the intended recipient, any dissemination, use, 
> review, distribution, printing or copying of the 
> information contained in this e-mail message 
> and/or attachments to it are strictly prohibited. If 
> you have received this communication in error, 
> please notify us by reply e-mail or telephone and 
> immediately and permanently delete the message 
> and any attachments. Thank you
> 
> 



Re: add a field, ordered

2012-08-23 Thread Alan Gates
Take a look at https://issues.apache.org/jira/browse/PIG-2353  I believe that's 
the JIRA for where they're doing the work.

Alan.

On Aug 14, 2012, at 3:38 AM, Lauren Blau wrote:

> Is the source for it available in the development area? I'd be happy to
> help if I can.
> Lauren
> 
> On Tue, Aug 14, 2012 at 6:05 AM, Gianmarco De Francisci Morales <
> g...@apache.org> wrote:
> 
>> Hi,
>> 
>> We are finalizing a feature that would solve your problems, something like
>> ROW_NUMBER in some SQL dialect, we call it RANK.
>> This operator will add a unique consecutive row number to each tuple in the
>> relationship.
>> Then you will be able to join the two relationships on the rank field.
>> 
>> For the moment being, however, I think there is no easy way to achieve what
>> you want to do.
>> 
>> Cheers,
>> --
>> Gianmarco
>> 
>> 
>> 
>> On Tue, Aug 14, 2012 at 11:55 AM, Lauren Blau <
>> lauren.b...@digitalreasoning.com> wrote:
>> 
>>> I  want to match up tuples from 2 relations. For each key, the 2
>> relations
>>> will always have the same number of tuples and match by position (the
>> first
>>> tuple in each are a match, the second tuple in each, etc).
>>> 
>>> so if I have
>>> relation1 = 5,9,7
>>> relation2 = z,a,d
>>> 
>>> I want to end up with
>>> 
>>> relation3 = (5,z),(9,a),(7,d)
>>> 
>>> I figure I need a way to generate a matching key on the ordered tuples of
>>> the relations and then do a cogroup. But I'm stuck on generating the key.
>>> Since adding a field is a project, I assume this has to be done as part
>> of
>>> a foreach loop. But I'm not sure how I can maintain the order while
>> adding
>>> a field to each tuple.
>>> 
>>> ideas?
>>> Thanks,
>>> lauren
>>> 
>> 



Re: Fallback for output data storage

2012-08-23 Thread Alan Gates
You can simply store the data twice at the end of your script.  Pig will split 
it and send it to both.  It shouldn't fail the HDFS storage if the dbstorage 
fails (but test this first to make sure I'm correct.)

So your script would look like:

A = load ...
store Z into 'db' using DBStorage();
store Z into '/data/fallback';

Alan.

On Aug 23, 2012, at 4:38 AM, Markus Resch wrote:

> Hi everyone,
> 
> we are planing to put our aggregations result into an external data
> base. To handle a connection failure to that external resource properly
> we currently store the result onto the hdfs and sync it to the db after
> that by a second pig script using the db's manufacturers pig data
> storage. We do that because we hardly can effort to redo all the
> aggregations in case of an error at the very end of the aggregation. 
> 
> If we could do something like to define a fallback data storage (e.g. to
> the hdfs) that will be used in case of an connection issue we could drop
> that entire second step an save a lot of effort. 
> Is there anything like this?
> 
> Kind Regards 
> 
> Markus
> 



Re: Issues with Bincond

2012-08-22 Thread Alan Gates
Use "is null" instead of "== null".  Equality, inequality, boolean, and 
arithmetic operators that encounter a null returning null is standard trinary 
logic.  The only possible answer to "is this equal to an unknown" is "unknown".

Alan.

On Aug 22, 2012, at 11:43 AM, Alex Rovner wrote:

> Thanks Cheolsoo,
> 
> Not very intuitive but makes sense.
> 
> 
> On Wed, Aug 22, 2012 at 2:37 PM, Cheolsoo Park wrote:
> 
>> Hi Alex,
>> 
>> I think that that's expected. The Pig manual says the following
>> regarding comparison
>> operators (e.g. ==):
>> 
>> If either sub-expression is null, the result is null.
>> 
>> 
>> So "col1 == null" is null.
>> 
>> Now it also says the following regarding arithmetic operators (e.g. ?):
>> 
>> If either sub-expression is null, the resulting expression is null.
>> 
>> 
>> So "col1 == null ? 'null' : 'not-null'" is null as "col1 == null" is null.
>> 
>> Here is the link:
>> http://pig.apache.org/docs/r0.10.0/basic.html#nulls
>> 
>> Thanks,
>> Cheolsoo
>> 
>> On Wed, Aug 22, 2012 at 11:28 AM, Alex Rovner 
>> wrote:
>> 
>>> I am having trouble with bincond in pig 11.
>>> 
>>> Sample input:
>>> 1234
>>> 0
>>> 1234
>>> 
>>> Sample pig script:
>>> a = LOAD 'input.txt' as (col1:int);
>>> 
>>> b = FOREACH a GENERATE col1, (col1 == null ? 'null' : 'not-null') as
>> col2;
>>> 
>>> dump b;
>>> 
>>> 
>>> Output:
>>> (1234,)
>>> (0,)
>>> (1234,)
>>> 
>>> 
>>> Certainly not what you expect to see... I expected to see 'not-null'
>> string
>>> in the second column.
>>> If I change the bincond to look for a particular value then everything
>>> works as expected:
>>> 
>>> b = FOREACH a GENERATE col1, (col1 == 1234 ? 'null' : 'not-null') as
>> col2;
>>> 
>>> Output:
>>> (1234,null)
>>> (0,not-null)
>>> (1234,null)
>>> 
>>> 
>>> Any ideas? I did not get a chance to test this with prior versions.
>>> 
>>> Thanks
>>> Alex
>>> 
>> 



Re: Pig as Connector with MongoDB and Node.js

2012-08-22 Thread Alan Gates

On Aug 21, 2012, at 11:48 PM, Santhosh M S wrote:

> 
> 
> Could we repost the entire blog and indicate that this blog originally 
> appeared here with the here being a hyperlink to the corporate blog without 
> mentioning the name of the corporation.
> 
> More thoughts?

I'm +1 on this.

Alan.


> 
> Santhosh
> 
> 
> 
> From: Russell Jurney 
> To: "user@pig.apache.org"  
> Sent: Tuesday, August 21, 2012 7:17 PM
> Subject: Re: Pig as Connector with MongoDB and Node.js
> 
> I like the idea of re-blogging the entire thing with a link back to
> the company. Blogs take time, and time is money, so posting to the Pig
> blog first isn't likely. Even personal posts about Pig on my blog
> datasyndrome.com, I'd rather post them on my blog and reblog/link back
> on the Pig blog. This is consistent with common practice.
> 
> The real point here is to get common place to recognize, index and
> distribute blog post HOWTOs as documentation. If there's value in the
> post, we should reblog it with a link back.
> 
> Russell Jurney http://datasyndrome.com
> 
> On Aug 21, 2012, at 7:03 PM, Alan Gates  wrote:
> 
>> Are you saying we should only post things to the Pig blog that isn't already 
>> on a corporate blog?  I'm not sure that's going to fly, since companies pay 
>> people to write blogs for them.  They aren't going to be excited to publish 
>> on Apache first.
>> 
>> If we don't feel comfortable posting things on the Pig blog that have 
>> already been posted on a corporate blog we could instead post very short 
>> blogs entries that say something like "A blog on X has been posted over on 
>> http://Y, go take a look".
>> 
>> Alan.
>> 
>> On Aug 17, 2012, at 3:28 PM, Santhosh M S wrote:
>> 
>>> Thanks Alan! I went to that site before my previous email and I did not 
>>> find anything and hence the post.
>>> 
>>> If the content has no association with any corporation, we should first 
>>> post it on the Apache Pig Blog and then cross post it on the corporate 
>>> blog. This way, we can decouple the community interests from the corporate 
>>> interests.
>>> 
>>> Thoughts?
>>> 
>>> Santhosh
>>> 
>>> 
>>> 
>>> From: Alan Gates 
>>> To: user@pig.apache.org
>>> Sent: Friday, August 17, 2012 3:20 PM
>>> Subject: Re: Pig as Connector with MongoDB and Node.js
>>> 
>>> http://blogs.apache.org/pig/
>>> 
>>> We don't have any posts there yet.
>>> 
>>> Alan.
>>> 
>>> On Aug 17, 2012, at 3:15 PM, Santhosh M S wrote:
>>> 
>>>> Before we post the blog, can someone post the URL for the Apache Pig Blog. 
>>>> Search engine queries are not returning anything useful.
>>>> 
>>>> Thanks,
>>>> Santhosh
>>>> 
>>>> 
>>>> 
>>>> From: Jonathan Coveney 
>>>> To: user@pig.apache.org
>>>> Sent: Friday, August 17, 2012 12:09 PM
>>>> Subject: Re: Pig as Connector with MongoDB and Node.js
>>>> 
>>>> I'm ok with that as long as it is clear it came from a corporate blog, and
>>>> of course, if people feel uncomfortable they should voice that opinion.
>>>> 
>>>> I think it is good to show that a variety of people use Pig, and I mean,
>>>> it's not really a surprise that Pig is developed, used, and promoted by
>>>> corporations :)
>>>> 
>>>> 2012/8/17 Alan Gates 
>>>> 
>>>>> I'm happy to repost these kinds of blog entries on the Pig blog.  But one
>>>>> thing we as a community need to decide is how we want to handle references
>>>>> to corporate blogs.  My proposal would be that any entries supporting and
>>>>> promoting Apache Pig should be allowed.  But I have an obvious conflict of
>>>>> interest here, so I'd like to get other people's inputs.
>>>>> 
>>>>> Alan.
>>>>> 
>>>>> On Aug 16, 2012, at 3:07 PM, Russell Jurney wrote:
>>>>> 
>>>>>> I wrote a Pig tutorial to publish data with Mongo and Node.js.
>>>>>> 
>>>>>> 
>>>>> http://hortonworks.com/blog/pig-as-connector-part-one-pig-mongodb-and-node-js
>>>>>> 
>>>>>> Is it possible to reblog on the Pig blog?
>>>>>> 
>>>>>> Russell Jurney
>>>>>> twitter.com/rjurney
>>>>>> russell.jur...@gmail.com
>>>>>> datasyndrome.com



Re: Pig as Connector with MongoDB and Node.js

2012-08-21 Thread Alan Gates
Are you saying we should only post things to the Pig blog that isn't already on 
a corporate blog?  I'm not sure that's going to fly, since companies pay people 
to write blogs for them.  They aren't going to be excited to publish on Apache 
first.  

If we don't feel comfortable posting things on the Pig blog that have already 
been posted on a corporate blog we could instead post very short blogs entries 
that say something like "A blog on X has been posted over on http://Y, go take 
a look".

Alan.

On Aug 17, 2012, at 3:28 PM, Santhosh M S wrote:

> Thanks Alan! I went to that site before my previous email and I did not find 
> anything and hence the post.
> 
> If the content has no association with any corporation, we should first post 
> it on the Apache Pig Blog and then cross post it on the corporate blog. This 
> way, we can decouple the community interests from the corporate interests.
> 
> Thoughts?
> 
> Santhosh
> 
> 
> 
> From: Alan Gates 
> To: user@pig.apache.org 
> Sent: Friday, August 17, 2012 3:20 PM
> Subject: Re: Pig as Connector with MongoDB and Node.js
> 
> http://blogs.apache.org/pig/
> 
> We don't have any posts there yet.
> 
> Alan.
> 
> On Aug 17, 2012, at 3:15 PM, Santhosh M S wrote:
> 
>> Before we post the blog, can someone post the URL for the Apache Pig Blog. 
>> Search engine queries are not returning anything useful.
>> 
>> Thanks,
>> Santhosh
>> 
>> 
>> 
>> From: Jonathan Coveney 
>> To: user@pig.apache.org 
>> Sent: Friday, August 17, 2012 12:09 PM
>> Subject: Re: Pig as Connector with MongoDB and Node.js
>> 
>> I'm ok with that as long as it is clear it came from a corporate blog, and
>> of course, if people feel uncomfortable they should voice that opinion.
>> 
>> I think it is good to show that a variety of people use Pig, and I mean,
>> it's not really a surprise that Pig is developed, used, and promoted by
>> corporations :)
>> 
>> 2012/8/17 Alan Gates 
>> 
>>> I'm happy to repost these kinds of blog entries on the Pig blog.  But one
>>> thing we as a community need to decide is how we want to handle references
>>> to corporate blogs.  My proposal would be that any entries supporting and
>>> promoting Apache Pig should be allowed.  But I have an obvious conflict of
>>> interest here, so I'd like to get other people's inputs.
>>> 
>>> Alan.
>>> 
>>> On Aug 16, 2012, at 3:07 PM, Russell Jurney wrote:
>>> 
>>>> I wrote a Pig tutorial to publish data with Mongo and Node.js.
>>>> 
>>>> 
>>> http://hortonworks.com/blog/pig-as-connector-part-one-pig-mongodb-and-node-js
>>>> 
>>>> Is it possible to reblog on the Pig blog?
>>>> 
>>>> Russell Jurney
>>>> twitter.com/rjurney
>>>> russell.jur...@gmail.com
>>>> datasyndrome.com



Re: runtime exception when load and store multiple files using avro in pig

2012-08-21 Thread Alan Gates
Moving it into core makes sense to me, as Avro is a format we should be 
supporting.

Alan.

On Aug 21, 2012, at 6:03 PM, Cheolsoo Park wrote:

> Hi Dan,
> 
> Glad to hear that it worked. I totally agree that AvroStorage can be
> improved. In fact, it was written for Pig 0.7, so it can be written much
> nicer now.
> 
> Only concern that I have is backward compatibility. That is, if I change
> syntax (I wanted so badly while working on AvroStorage recently), it will
> break backward compatibility. What I have been thinking is to
> rewrite AvroStorage in core Pig like HBaseStorage. For
> backward compatibility, we may keep the old version in Piggybank for a
> while and eventually retire it.
> 
> I am wondering what other people think. Please let me know if it is not a
> good idea to move AvroStorage to core Pig from Piggybank.
> 
> Thanks,
> Cheolsoo
> 
> On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li  wrote:
> 
>> Thanks, Cheolsoo. That solve my problems.
>> 
>> It will be nice if pig can do this automatically when there are multiple
>> avrostorage in the code. Otherwise, we have to manually track the numbers.
>> 
>> Dan
>> 
>> -Original Message-
>> From: Cheolsoo Park [mailto:cheol...@cloudera.com]
>> Sent: Tuesday, August 21, 2012 5:06 PM
>> To: user@pig.apache.org
>> Subject: Re: runtime exception when load and store multiple files using
>> avro in pig
>> 
>> Hi Danfeng,
>> 
>> The "long" is from the 1st AvroStorage store in your script. The
>> AvroStorage has very funny syntax regarding multiple stores. To apply
>> different avro schemas to multiple stores, you have to specify their
>> "index" as follows:
>> 
>> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
>> into 'set1' using
>> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>> 
>> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into
>> 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
>> '2');*
>> 
>> As can be seen, I added the 'index' parameters.
>> 
>> What AvroStorage does is to construct the following string in the frontend:
>> 
>> "1#<1st avro schema>,2#<2nd avro schema>"
>> 
>> and pass it to backend via UdfContext. Now in backend, tasks parse this
>> string to get output schema for each store.
>> 
>> Thanks,
>> Cheolsoo
>> 
>> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li 
>> wrote:
>> 
>>> I run into this strange problem when try to load multiple text
>>> formatted files and convert them into avro format using pig. However,
>>> if I read and convert one file at a time in separated runs, everything
>>> is fine. The error message is following
>>> 
>>> 2012-08-21 19:15:32,964 [main] ERROR
>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
>>> recreate exception from backed error:
>>> org.apache.avro.file.DataFileWriter$AppendWriteException:
>>> java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
>>> union ["null","long"]
>>>at
>>> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
>>>at
>>> 
>> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
>>>at
>>> 
>> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
>>>at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>>>at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>>>at
>>> 
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
>>>at
>>> 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>at
>>> 
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
>>>at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGeneri
>>> cMapB
>>> 
>>> my code is
>>> set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
>>>   id:long,
>>>   f1:long,
>>>   f2:chararray,
>>>   f3:float,
>>>   f4:float,
>>>   f5:float,
>>>   f6:float,
>>>   f7:float,
>>>   f8:float,
>>>   f9:float,
>>>   f10:float,
>>>   f11:float,
>>>   f12:float);
>>> store set1 into '$output_dir/set1.avro'
>>> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> 
>>> set2 = load '$input_dir/set2.txt' using PigStorage('|') as (
>>>   id : int,
>>>   date : chararray);
>>> store set2 into '$output_dir/set2.avro'
>>> using org.apache.pig.piggybank.storage.avro.AvroStorage();
>>> 
>>> The first file is converted fine, but the 2nd one is failed. The error
>>> is coming from the 2nd field in the 2nd file, but the strange thing is
>>> that I don't even have "long" in my schema while the error message is
>>> showing ["null","long"].
>>> 
>>> I use pig 0.10.0 and avro-1.7.1.jar.
>>> 
>

Re: Pig as Connector with MongoDB and Node.js

2012-08-17 Thread Alan Gates
http://blogs.apache.org/pig/

We don't have any posts there yet.

Alan.

On Aug 17, 2012, at 3:15 PM, Santhosh M S wrote:

> Before we post the blog, can someone post the URL for the Apache Pig Blog. 
> Search engine queries are not returning anything useful.
> 
> Thanks,
> Santhosh
> 
> 
> 
> From: Jonathan Coveney 
> To: user@pig.apache.org 
> Sent: Friday, August 17, 2012 12:09 PM
> Subject: Re: Pig as Connector with MongoDB and Node.js
> 
> I'm ok with that as long as it is clear it came from a corporate blog, and
> of course, if people feel uncomfortable they should voice that opinion.
> 
> I think it is good to show that a variety of people use Pig, and I mean,
> it's not really a surprise that Pig is developed, used, and promoted by
> corporations :)
> 
> 2012/8/17 Alan Gates 
> 
>> I'm happy to repost these kinds of blog entries on the Pig blog.  But one
>> thing we as a community need to decide is how we want to handle references
>> to corporate blogs.  My proposal would be that any entries supporting and
>> promoting Apache Pig should be allowed.  But I have an obvious conflict of
>> interest here, so I'd like to get other people's inputs.
>> 
>> Alan.
>> 
>> On Aug 16, 2012, at 3:07 PM, Russell Jurney wrote:
>> 
>>> I wrote a Pig tutorial to publish data with Mongo and Node.js.
>>> 
>>> 
>> http://hortonworks.com/blog/pig-as-connector-part-one-pig-mongodb-and-node-js
>>> 
>>> Is it possible to reblog on the Pig blog?
>>> 
>>> Russell Jurney
>>> twitter.com/rjurney
>>> russell.jur...@gmail.com
>>> datasyndrome.com
>> 



Re: Pig as Connector with MongoDB and Node.js

2012-08-17 Thread Alan Gates
I'm happy to repost these kinds of blog entries on the Pig blog.  But one thing 
we as a community need to decide is how we want to handle references to 
corporate blogs.  My proposal would be that any entries supporting and 
promoting Apache Pig should be allowed.  But I have an obvious conflict of 
interest here, so I'd like to get other people's inputs.

Alan.

On Aug 16, 2012, at 3:07 PM, Russell Jurney wrote:

> I wrote a Pig tutorial to publish data with Mongo and Node.js.
> 
> http://hortonworks.com/blog/pig-as-connector-part-one-pig-mongodb-and-node-js
> 
> Is it possible to reblog on the Pig blog?
> 
> Russell Jurney
> twitter.com/rjurney
> russell.jur...@gmail.com
> datasyndrome.com



Re: Distributed accumulator functions

2012-08-13 Thread Alan Gates

On Aug 13, 2012, at 9:05 AM, Benjamin Smedberg wrote:

> I'm a new-ish pig user querying data on an hbase cluster. I have a question 
> about accumulator-style functions.
> 
> When writing an accumulator-style UDF, is all of the data shipped to a single 
> machine before it is reduced/accumulated? For example, if I were doing to 
> write re-implement SUM as a UDF, it seems to me that it would be more 
> efficient to run SUM on each map node, and then do a sum-of-sums when 
> reducing. Is there a way to write a UDF which supports this style of 
> accumulation/aggregation?

How many reducers are involved in an operation is independent of the type of 
UDF you use.  The number of reducers is determined by the parallelism you 
declare in your script (via the parallel clause in your group statement or via 
a set default parallelism statement in your script) or by the default Pig 
chooses.  

As to whether it is more efficient to do a sum of sums, it certainly is. For 
those types of operations you should use an algebraic UDF rather than an 
accumulative.  Algebraic UDFs have an initial (map), intermediate (combiner), 
and final (reducer) steps.  Accumulative UDFs are for operations that cannot be 
distributed but that only need to see the data stream once.  An example would 
be cumulative sums, where you want to return not just a final sum but a list of 
the sums as you went along.  This is order dependent and thus can't be done 
until you've collected all the values for a given key.

> 
> Also, is PigStorage compatible with the quoting expected by excel 
> tab-delimited files? AIUI that would require quoting the values with 
> "value\tvalue" and escaping double quotes. If this isn't the native 
> PigStorage format, is there a storage module already written which supports 
> excel-tab output?

PigStorage doesn't support escaping.  I am not aware of a storage function 
focussed on excel CSV format, but others may be.

Alan.

> 
> --BDS
> 



Re: FileAlreadyExistsException while running pig

2012-08-10 Thread Alan Gates
Usually that means the the directory you are trying to store to already exists. 
 Pig won't overwrite existing data.  You should either move or remove the 
directory or change the directory name in your store function.

Alan.

On Aug 9, 2012, at 7:42 PM, Haitao Yao wrote:

> hi, all
>   I got this while running pig script: 
> 
> 997: Unable to recreate exception from backend error:
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> hdfs://DC-hadoop01:9000/tmp/pig-temp/temp548500412/tmp-1456742965 already 
> exists
>at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
>at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecsHelper(PigOutputFormat.java:207)
>at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.checkOutputSpecs(PigOutputFormat.java:188)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:893)
>at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:415)
>at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:856)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:830)
>at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
>at 
> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
>at 
> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
>at java.lang.Thread.run(Thread.java:722)
> 
> 
> But I checked the script , the directory:  
> hdfs://DC-hadoop01:9000/tmp/pig-temp/temp548500412/tmp-1456742965 is not used 
> by the script explicitly, so I think it is used by the pig to store tmp 
> results.
> But why it exists? Isn't it unique?
> 
> 
> 
> 
> 
> 
> 
> 
> Haitao Yao
> yao.e...@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 



Re: User Defined Comparator

2012-08-09 Thread Alan Gates
There isn't a replacement for ComparisonFunc.  That was written before Pig had 
types so that users could do type specific comparison functions.   With the 
addition of types it was felt that ComparisonFunc was no longer necessary.  
That said, it's never been removed.  The testing is limited at this point, so I 
don't know how well it works.  If you find it works for you though you could 
still use it.

Alan.

On Jul 30, 2012, at 6:05 PM, Calvin Cheung wrote:

> Since ComparisonFunc is now depreciated, what is its replacement? I can't
> any information in the Javadoc. Is it safe to continue to extend
> ComparisonFunc for custom ordering?
> 
> Thanks.
> Calvin



Next Pig Hackathon

2012-07-30 Thread Alan Gates
Hortonworks will be hosting the next Pig Hackathon on August 24th.  
http://www.meetup.com/PigUser/events/75286212/

The agenda:

- Help newcomers get started on their first UDF or patch and walk through the 
Apache submission process

- Get the committers to look at patches that are ready but haven't been 
reviewed yet.

- Hammer out a proposal for a next generation UDF API.

- Work/discuss plans for any new features, bug fixes, etc. that attendees are 
interested in.

Hortonworks will provide lunch.

Please RSVP at the above URL and we look forward to seeing you there.

Alan.

Re: Trunk version does not like my macros

2012-07-26 Thread Alan Gates
Apache mail servers strip attachments.  Could you post your script somewhere or 
send it inline?

Alan.

On Jul 26, 2012, at 7:41 AM, Alex Rovner wrote:

> Gentlemen,
> 
> We have recently attempted to compile and use the latest trunk code and have 
> encountered a rather strange issue. Our job which is attached, has been 
> working fine on V11 of pig that we have compiled of trunk a while back:
> 
> Apache Pig version 0.11.0-SNAPSHOT (r1227411) 
> compiled Jan 04 2012, 19:34:06
> 
> When we attempted to switch to the latest trunk version yesterday, we have 
> encountered the following exception:
> 
> 
> Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: 
> Error during parsing. Can not create a Path from a null string
>   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1595)
>   at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1534)
>   at org.apache.pig.PigServer.registerQuery(PigServer.java:516)
>   at 
> org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:987)
>   at 
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:412)
>   at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
>   at org.apache.pig.PigServer.registerScript(PigServer.java:590)
>   at org.apache.pig.PigServer.registerScript(PigServer.java:692)
>   at org.apache.pig.PigServer.registerScript(PigServer.java:665)
>   at com.proclivitysystems.etl.job.PIGJobRunner.run(PIGJobRunner.java:244)
>   ... 2 more
> Caused by: java.lang.IllegalArgumentException: Can not create a Path from a 
> null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
>   at org.apache.hadoop.fs.Path.(Path.java:90)
>   at 
> org.apache.pig.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:766)
>   at 
> org.apache.pig.impl.io.FileLocalizer.fetchFile(FileLocalizer.java:733)
>   at 
> org.apache.pig.parser.QueryParserDriver.getMacroFile(QueryParserDriver.java:350)
>   at 
> org.apache.pig.parser.QueryParserDriver.makeMacroDef(QueryParserDriver.java:406)
>   at 
> org.apache.pig.parser.QueryParserDriver.expandMacro(QueryParserDriver.java:268)
>   at 
> org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:169)
>   at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1587)
>   ... 11 more
> 
> 
> I have tried to step through to figure out whats going on, and it seems like 
> the parser is trying to load our macro named "roas" from a "null" file thus 
> causing this issue. As you can see in the script we are not referencing any 
> external macros. All macros are defined within the file.
> 
> Any help would be appreciated.
> 
> Thanks
> Alex



Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

2012-07-26 Thread Alan Gates
How will you handle ambiguities when there is an A::b and B::b?

Alan.

On Jul 26, 2012, at 6:54 AM, Alex Rovner wrote:

> I am proposing to patch avrostorage to have an option of storing field names 
> without their relation name. A::b will be saved as "b". 
> 
> Thoughts?
> 
> Sent from my iPhone
> 
> On Jul 25, 2012, at 5:48 AM, "Florian Zumkeller-Quast" 
>  wrote:
> 
>> Hello,
>> I got the following code:
>> 
>> A = LOAD '§file1' USING AvroStorage();
>> B = LOAD '$file2' USING AvroStorage();
>> C = JOIN A BY id LEFT OUTER, B BY id;
>> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
>> 
>> DESCRIBE shows the following data structure
>> 
>> D: {A::id: long,A::time: int,B::id: long,B::time: int}
>> E: {A::id: long,A::time: int,B::id: long,B::time: int}
>> 
>> But i can't store D and E using AvroStorage because the filed names contain 
>> "::" which is not an allowed character.
>> 
>> I need  structure like
>> F: {id: long,time: int}
>> where id = E::A::id and time = E::A::time.
>> 
>> The problem is: The number, name and type of fields may vary.
>> 
>> So E might looks like
>> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 
>> int}
>> 
>> Thus I can't use
>> 
>> F = FOREACH … GENERATE …;
>> 
>> because i don't want to write code for each filetype as long as I don't 
>> really 
>> need to.
>> 
>> Can someone give me an advice how to get the result I need?
>> 
>> Thanks!
>> 
>> With kind regards
>> Florian Zumkeller-Quast
>> -- 
>> Developer
>> 
>> 
>> ADITION technologies AG
>> Schwarzwaldstraße 78b
>> 79117 Freiburg
>> 
>> http://www.adition.com
>> 
>> T +49 / (0)761 / 88147 - 30
>> F +49 / (0)761 / 88147 - 77
>> SUPPORT +49  / (0)1805 - ADITION
>> 
>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
>> 
>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
>> UStIDNr.: DE 218 858 434



Re: when Algebraic UDF are used ?

2012-07-25 Thread Alan Gates
It can't use the algebraic interface in this case because the data has to be 
sorted (which means it has to see all the data) before passing it to your UDF.  
If you remove the ORDER statement then the algebraic portion of your UDF will 
be invoked.

Alan.

On Jul 25, 2012, at 9:32 AM, Benoit Mathieu wrote:

> Hi pig users,
> 
> I have coded my own algebraic UDF in Java, and it seems that pig do not use
> the algebraic interface at all. (I put some log messages in my
> Initial,Intermed and Final functions, and they re never logged).
> Pig uses only the main "exec" function.
> 
> My UDF needs to get the bag sorted.
> Here is my pig script:
> 
> A = LOAD '...' USING PigStorage() AS (k1:int,k2:int,value:int);
> B = GROUP A BY k1;
> C = FOREACH B {
>  tmp = ORDER A.(k2,value) BY k2;
>  GENERATE group, MyUDF(tmp);
> }
> ...
> 
> 
> Does anyone know why pig does not use the algebraic interface ?
> 
> thanks,
> 
> Benoit



Re: Access only data from LEFT OUTER JOIN side of joined data without projection prefix

2012-07-25 Thread Alan Gates
Basically you need to transform the schema, not the data.  The easiest way I 
can think of to do that is to use a UDF that has an outputSchema function that 
renames columns.  The exec call can then be a simple pass through.  

If you wanted to you could have it consolidate the join keys.  You imply you 
would like to consolidate other columns as well (A::E::time in your example), 
but that is not valid.  Since time is not a join key it will not necessarily be 
the same in A and E.

Alan.

On Jul 25, 2012, at 2:48 AM, Florian Zumkeller-Quast wrote:

> Hello,
> I got the following code:
> 
> A = LOAD '§file1' USING AvroStorage();
> B = LOAD '$file2' USING AvroStorage();
> C = JOIN A BY id LEFT OUTER, B BY id;
> SPLIT C INTO D IF B::id IS NULL, E OTHERWISE;
> 
> DESCRIBE shows the following data structure
> 
> D: {A::id: long,A::time: int,B::id: long,B::time: int}
> E: {A::id: long,A::time: int,B::id: long,B::time: int}
> 
> But i can't store D and E using AvroStorage because the filed names contain 
> "::" which is not an allowed character.
> 
> I need  structure like
> F: {id: long,time: int}
> where id = E::A::id and time = E::A::time.
> 
> The problem is: The number, name and type of fields may vary.
> 
> So E might looks like
> E: {A::id: long,A::time: int,A::fieldN1,B::id: long,B::time: int,B::fieldN1 
> int}
> 
> Thus I can't use
> 
> F = FOREACH … GENERATE …;
> 
> because i don't want to write code for each filetype as long as I don't 
> really 
> need to.
> 
> Can someone give me an advice how to get the result I need?
> 
> Thanks!
> 
> With kind regards
> Florian Zumkeller-Quast
> -- 
> Developer
> 
> 
> ADITION technologies AG
> Schwarzwaldstraße 78b
> 79117 Freiburg
> 
> http://www.adition.com
> 
> T +49 / (0)761 / 88147 - 30
> F +49 / (0)761 / 88147 - 77
> SUPPORT +49  / (0)1805 - ADITION
> 
> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min)
> 
> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076
> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter
> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer
> UStIDNr.: DE 218 858 434



Re: None. wtf is None?

2012-07-24 Thread Alan Gates
Can you attach a sample of the input data?  I'm guessing None came from the 
input data.  

Alan.

On Jul 23, 2012, at 10:49 PM, Russell Jurney wrote:

> Can someone explain this script to me? It is freaking me out. When did Pig
> start spitting out 'None' in place of null?
> 
> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> register /me/pig/contrib/piggybank/java/piggybank.jar
> 
> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> 
> rmf /tmp/sent_mails
> rmf /tmp/replies
> 
> /* Get rid of emails with reply_to, as they confuse everything in mailing
> lists. */
> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> clean_emails = filter avro_emails by froms is not null and reply_tos is
> null;
> 
> /* Treat emails without in_reply_to as sent emails */
> combined_emails = foreach clean_emails generate froms, tos, message_id;
> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> from, *
> *  flatten(tos.address) as to, *
> *  message_id;*
> store sent_mails into '/tmp/sent_mails';
> 
> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> nulls */
> *replies = filter clean_emails by in_reply_to is not null;*
> *replies = foreach replies generate flatten(froms.address) as from,*
> *   flatten(tos.address) as to,*
> *   in_reply_to;*
> store replies into '/tmp/replies';
> 
> 
> Despite filtering replies to emails that only have the 'in_reply_to'
> field... I get the same number of records in both relations I store:
> 
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>   17431
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>   17431
> 
> 
> Investigating shows me:
> 
> cat /tmp/replies/part-1
> 
> c...@hotmail.com russell.jur...@gmail.com None
> c...@hotmail.com russell.jur...@gmail.com
>  voice-nore...@google.com russell.jur...@gmail.com None
> 
> 
> Where did *None* come from? I thought FLATTEN would prune records with
> empty columns, and I'm ok with it not but... what operators does None
> respond to? It is not null. How do I prune these?
> -- 
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com



Re: Can't JOIN self?

2012-07-20 Thread Alan Gates
It isn't a bug that you need to declare the join twice in your script.  That is 
necessary for clarity and semantic correctness.  That is, if we allowed:

A = load 'bla';
B = join A by user, A by user;

then you'd have two user fields in the B with no way to disambiguate.  What's a 
bug (or missed optimization opportunity) is that we actually double read and 
shuffle the data.  We could optimize here and only read shuffle one copy and 
then do the join in the reduce.

Alan.

On Jul 20, 2012, at 12:53 AM, Dmitriy Ryaboy wrote:

> It's kind if a waste of io and mappers. If not a bug, it's an optimization 
> opportunity. 
> 
> On Jul 19, 2012, at 10:34 PM, Bill Graham  wrote:
> 
>> No, it isn't a bug as I see it. You need to load the two relations
>> separately because a join is across two separate data sources.
>> 
>> 
>> On Thu, Jul 19, 2012 at 10:10 PM, Russell Jurney
>> wrote:
>> 
>>> So it is a bug? Because Pig will not let me self JOIN. I have to LOAD the
>>> data twice.
>>> 
>>> On Thu, Jul 19, 2012 at 9:49 PM, Bill Graham  wrote:
>>> 
 No, to Pig a self join is just like a regular join across two different
 relations. It just happens to be to the same input data.
 
 On Thu, Jul 19, 2012 at 8:39 PM, Russell Jurney  wrote:
 
> Is this a bug?
> 
> On Thu, Jul 19, 2012 at 8:00 PM, Robert Yerex <
> robert.ye...@civitaslearning.com> wrote:
> 
>> The only way to get it to work is to load a second copy.
>> 
>> On Thu, Jul 19, 2012 at 7:46 PM, Russell Jurney <
> russell.jur...@gmail.com
>>> wrote:
>> 
>>> Note: this works if I LOAD a new, 2nd relation and do the join.
>>> 
>>> On Thu, Jul 19, 2012 at 7:34 PM, Russell Jurney <
>> russell.jur...@gmail.com
 wrote:
>>> 
 I have a problem where I can't join a relation to itself on a
> different
 field.
 
 describe pairs
 pairs: {from: chararray,to: chararray,message_id:
>> chararray,in_reply_to:
 chararray}
 
 pairs2 = pairs;
 
 with_reply = join pairs by in_reply_to, pairs2 by message_id;
 
 
 I get this error:
 
 2012-07-19 19:31:16,927 [main] ERROR
> org.apache.pig.tools.grunt.Grunt -
 ERROR 1200: Pig script failed to parse:
  pig script failed to validate:
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225:
>>> Projection
 with nothing to reference!
 2012-07-19 19:31:16,928 [main] ERROR
> org.apache.pig.tools.grunt.Grunt -
 Failed to parse: Pig script failed to parse:
  pig script failed to validate:
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225:
>>> Projection
 with nothing to reference!
 at
 
>> 
 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:182)
 at
 org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1565)
 at
 org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1538)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:540)
 at
>>> 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970)
 at
 
>>> 
>> 
> 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
 at
 
>>> 
>> 
> 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189)
 at
 
>>> 
>> 
> 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
 at org.apache.pig.Main.run(Main.java:490)
 at org.apache.pig.Main.main(Main.java:111)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
>>> 
>> 
> 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
>>> 
>> 
> 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 Caused by:
  pig script failed to validate:
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2225:
>>> Projection
 with nothing to reference!
 at
 
>>> 
>> 
> 
 org.apache.pig.parser.LogicalPlanBuilder.buildJoinOp(LogicalPlanBuilder.java:363)
 at
 
>>> 
>> 
> 
 org.apache.pig.parser.LogicalPlanGenerator.join_clause(LogicalPlanGenerator.java:11354)
 at
 
>>> 
>> 
> 
 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1489)
 at
 
>>> 
>> 
> 
 org.apache.pig.parser.LogicalPlanGenerator.general_statement(

Re: apache tar releases don't contain piggybank as a jar

2012-07-16 Thread Alan Gates
The big reason is we'd like to split off piggybank into a separate source 
control system (like github) rather than keeping it in Pig proper.  Given this, 
it doesn't make sense to be releasing piggybank with Pig.

Alan.

On Jul 12, 2012, at 9:37 AM, David Capwell wrote:

> Is there a reason that piggybank isn't compiled and put into the apache tar
> releases as a jar?



Re: Join with greater/less then condition

2012-07-05 Thread Alan Gates
Pig can only do equi-joins.  Theta joins are hard in MapReduce.  So the way to 
do this is do the equi-join and then filter afterwards.  This will not create 
significant additional cost since the join results will be filtered before 
being materialized to disk.

C = Join table_a on user_id, title_id, table_b on user_id, title_id;
D = filter C by table_a::timestamp > table_b::timestamp;

Alan.

On Jul 5, 2012, at 12:21 PM, sonia gehlot wrote:

> Hi Guys,
> 
> I want to join 2 tables in hive on couple of columns and out them one
> condition is timestamp of one column is greater then the other one. In SQL
> I could have written in this way:
> 
> table_a a Join table_b b
> on a.user_id = b.user_id
> and a.title_id = b.title_id
> and a.timestamp > b.timestamp
> 
> How to write last condition in Pig? *a.timestamp > b.timestamp*
> 
> Thanks,
> Sonia



Re: One file with sorted results.

2012-07-03 Thread Alan Gates
You can set different parallel levels at different parts of your script by 
attaching parallel to the different operations.  For example:

Y = join W by a, X by b parallel 100;
Z = order Y by a parallel 1;
store Z into 'onefile';

If your output is big I would suggest trying out ordering in parallel as well 
and then using HDFS's cat command in a separate pass to see if it is faster.  
It will write twice but it won't flood one reducer with all of the data.

Alan.

On Jul 2, 2012, at 4:59 PM, sonia gehlot wrote:

> Hi Guys,
> 
> I have use case, where I need to generate data feed using Pig script. Data
> feed in total is of about 12 GB.
> 
> I want Pig script to generate 1 file and data in that data should be sorted
> as well. I know I can run it with one reducer as dataset is big with lot of
> Joins it takes forever to finish.
> 
> What are the other options to get one sorted file with better performance.
> 
> Thanks in advance,
> 
> Sonia



Re: Best Practice: store depending on data content

2012-07-02 Thread Alan Gates

On Jul 2, 2012, at 5:57 AM, Ruslan Al-Fakikh wrote:

> Hey Alan,
> 
> I am not familiar with Apache processes, so I could be wrong in my
> point 1, I am sorry.
I wasn't trying to say you were right or wrong, just trying to understand your 
perspective.

> Basically my impressions was that Cloudera is pushing Avro format for
> intercommunications between hadoop tools like pig, hive and mapreduce.
> https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
> http://www.cloudera.com/blog/2011/07/avro-data-interop/
> And if I decide to use Avro then HCatalog becomes a little redundant.
> It would give me the list of datasets in one place accessible from all
> tools, but all the columns (names and types) would be stored in Avro
> schemas and Hive metastore becomes just a stub for those Avro schemas:
> https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables
> And having those avro schemas I could access data from pig and
> mapreduce without HCatalog. Though I haven't figured out how to deal
> without hive partitions yet.

It's true Avro can store schema data.  HCatalog does much more than this and 
aspires to add to that set of features in the future.  It will soon provide a 
REST API for external systems to interact with the metadata.  It allows you to 
store data in HBase or other non-HDFS systems.  In the future it will provide 
interfaces to data life cycle management tools like cleaning tools, replication 
tools, etc.  And it does not bind you to one storage format.  That said, if you 
don't need any of these things Avro may be a good solution for your situation.  
Definitely choose the tool that best fits your need.

Alan.

> 
> Best Regards,
> Ruslan
> 
> On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates  wrote:
>> On a different topic, I'm interested in why you refuse to use a project in 
>> the incubator.  Incubation is the Apache process by why a community is built 
>> around the code.  It says nothing about the maturity of the code.
>> 
>> Alan.
>> 
>> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:
>> 
>>> Hi Markus,
>>> 
>>> Currently I am doing almost the same task. But in Hive.
>>> In Hive you can use the native Avro+Hive integration:
>>> https://issues.apache.org/jira/browse/HIVE-895
>>> Or haivvreo project if you are not using the latest version of Hive.
>>> Also there is a Dynamic Partition feature in Hive that can separate
>>> your data by a column value.
>>> 
>>> As for HCatalog - I refused to use it after some investigation, because:
>>> 1) It is still incubating
>>> 2) It is not supported by Cloudera (the distribution provider we are
>>> currently using)
>>> 
>>> I think it would be perfect if MultiStorage would be generic in the
>>> way you described, but I am not familiar with it.
>>> 
>>> Ruslan
>>> 
>>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair  wrote:
>>>> I am not aware of any work on adding those features to MultiStorage.
>>>> 
>>>> I think the best way to do this is to use Hcatalog. (It makes the hive
>>>> metastore available for all of hadoop, so you get metadata for your data as
>>>> well).
>>>> You can associate a outputformat+serde for a table (instead of file name
>>>> ending), and HCatStorage will automatically pick the right format.
>>>> 
>>>> Thanks,
>>>> Thejas
>>>> 
>>>> 
>>>> 
>>>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>>>> 
>>>>> Thanks Thejas,
>>>>> 
>>>>> This _really_ helped a lot :)
>>>>> Some additional question on this:
>>>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>>>> output, right? Is there any attempt ongoing currently to make this
>>>>> storage more generic regarding the format of the output data? For our
>>>>> needs we would require AVRO output as well as some special proprietary
>>>>> binary encoding for which we already created our own storage. I'm
>>>>> thinking about a storage that will select a certain writer method
>>>>> depending to the file names ending.
>>>>> 
>>>>> Do you know of such efforts?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Markus
>>>>> 
>>>>> 
>>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>>>> 
>>>>>> You can use MultiStorage store func -
>>

Re: Best Practice: store depending on data content

2012-06-29 Thread Alan Gates
On a different topic, I'm interested in why you refuse to use a project in the 
incubator.  Incubation is the Apache process by why a community is built around 
the code.  It says nothing about the maturity of the code.  

Alan.

On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:

> Hi Markus,
> 
> Currently I am doing almost the same task. But in Hive.
> In Hive you can use the native Avro+Hive integration:
> https://issues.apache.org/jira/browse/HIVE-895
> Or haivvreo project if you are not using the latest version of Hive.
> Also there is a Dynamic Partition feature in Hive that can separate
> your data by a column value.
> 
> As for HCatalog - I refused to use it after some investigation, because:
> 1) It is still incubating
> 2) It is not supported by Cloudera (the distribution provider we are
> currently using)
> 
> I think it would be perfect if MultiStorage would be generic in the
> way you described, but I am not familiar with it.
> 
> Ruslan
> 
> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair  wrote:
>> I am not aware of any work on adding those features to MultiStorage.
>> 
>> I think the best way to do this is to use Hcatalog. (It makes the hive
>> metastore available for all of hadoop, so you get metadata for your data as
>> well).
>> You can associate a outputformat+serde for a table (instead of file name
>> ending), and HCatStorage will automatically pick the right format.
>> 
>> Thanks,
>> Thejas
>> 
>> 
>> 
>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>> 
>>> Thanks Thejas,
>>> 
>>> This _really_ helped a lot :)
>>> Some additional question on this:
>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>> output, right? Is there any attempt ongoing currently to make this
>>> storage more generic regarding the format of the output data? For our
>>> needs we would require AVRO output as well as some special proprietary
>>> binary encoding for which we already created our own storage. I'm
>>> thinking about a storage that will select a certain writer method
>>> depending to the file names ending.
>>> 
>>> Do you know of such efforts?
>>> 
>>> Thanks
>>> 
>>> Markus
>>> 
>>> 
>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
 
 You can use MultiStorage store func -
 
 http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
 
 Or if you want something more flexible, and have metadata as well, use
 hcatalog . Specify the keys on which you want to partition as your
 partition keys in the table. Then use HcatStorer() to store the data.
 See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
 
 Thanks,
 Thejas
 
 
 
 On 6/22/12 4:54 AM, Markus Resch wrote:
> 
> Hey everyone,
> 
> We're doing some aggregation. The result contains a key where we want to
> have a single output file for each key. Is it possible to store files
> like this? Especially adjusting the path by the key's value.
> 
> Example:
> Input = LOAD 'my/data.avro' USING AvroStorage;
> [ doing stuff]
> Output = GROUP AggregatesValues BY Key;
> FOREACH Output Store * into
> '/my/output/path/by/$Output.Key/Result.avro'
> 
> I know this example does not work. But is there anything similar
> possible? And, as I assume, not: is there some framework in the hadoop
> world that can do such stuff?
> 
> 
> Thanks
> 
> Markus
> 
> 
>>> 
>>> 
>> 



Re: modulize pig scripts via 'run'; pass param containing special chars

2012-06-29 Thread Alan Gates
Does putting the parameters in a file using -param_file help?

Alan.

On Jun 27, 2012, at 9:02 AM, Markus Resch wrote:

> Hey everyone, 
> 
> we're still using CDH3u3 pig (0.8.1). 
> As out pig scripts are growing we like to split them to modules and call
> them via run. the parameter substitution allows us to write very generic
> scripts and modify them while calling. This worked very well until we
> came to the point where we tried to pass a kind of complex stack of UDF
> calls:
> 
> run -param timestamp=$timestamp -param
> time_in_customers_timezone="(int)SUBSTRING(DATE_TIME(UnixToISO(((long)Our.TimeStamp)*1000),
>  Timezone),11,13) AS Hour" our_script.pig
> 
> This line created the following error message:
> 
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during
> parsing. Lexical error at line 32, column 63.  Encountered: "(" (40),
> after : ""
> 
> We've desperately tried a lot of things (escaping, single/double quotes,
> storing it in a local string ...)
> 
> Does anyone have a suggestion for us?
> 
> Thanks
> 
> Markus
> 
> 
> 
> 



  1   2   3   4   >