RE: NoClassDefFoundError after upgrading to pig 0.10.0 from 0.9.0

2012-07-02 Thread Matthew Hayes
Sure thanks for confirming, will do :) -Matt From: Gianmarco De Francisci Morales [g...@apache.org] Sent: Monday, July 02, 2012 6:36 AM To: user@pig.apache.org Subject: Re: NoClassDefFoundError after upgrading to pig 0.10.0 from 0.9.0 We can simply generat

Replace function in pig 0.8

2012-07-02 Thread Ranjith
Does the replace function replace adjacent occurrences of the string or does one need to specify it using regex? Thanks, Ranjith

RE: One file with sorted results.

2012-07-02 Thread Duckworth, Will
Have you tried breaking it into 2 jobs? The first are the pre-sort work then a final job with the sort and single reducer? Will Duckworth Senior Vice President, Software Engineering | comScore, Inc.(NASDAQ:SCOR) o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:wduckwo...@comscore.com

One file with sorted results.

2012-07-02 Thread sonia gehlot
Hi Guys, I have use case, where I need to generate data feed using Pig script. Data feed in total is of about 12 GB. I want Pig script to generate 1 file and data in that data should be sorted as well. I know I can run it with one reducer as dataset is big with lot of Joins it takes forever to fi

Re: Read avro record from HDFS

2012-07-02 Thread Fabian Alenius
Thanks! Unfortunately I'm stuck with pig 0.9.2 so I'm looking for an example in Java. On Tue, Jul 3, 2012 at 1:03 AM, Russell Jurney wrote: > Checkout chapter 3 of my book for a tutorial. Be sure to use pig 0.10. > > http://ofps.oreilly.com/titles/9781449326265/chapter_3.html > > Russell Jurney

Re: Read avro record from HDFS

2012-07-02 Thread Russell Jurney
Checkout chapter 3 of my book for a tutorial. Be sure to use pig 0.10. http://ofps.oreilly.com/titles/9781449326265/chapter_3.html Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Jul 2, 2012, at 3:50 PM, Fabian Alenius wrote: > Hi, > > does anyone happen to have

Read avro record from HDFS

2012-07-02 Thread Fabian Alenius
Hi, does anyone happen to have a sample of how to load a avro record from HDFS given a location. In my case the schema is just "binary". I'm working on a custom loader and I've been playing around with the avro API, but so far no luck. Thanks, Fabian

Re: What is the best way to do counting in pig?

2012-07-02 Thread Subir S
Right!! Since it is mentioned that job is hanging, wild guess is it must be 'group all'. How can that be confirmed? On 7/3/12, Jonathan Coveney wrote: > group all uses a single reducer, but COUNT is algebraic, and as such, will > use combiners, so it is generally quite fast. > > 2012/7/2 Subir S

Re: Generating multiple tuples from single tuple

2012-07-02 Thread naresh
@Jonathan Conveney: Thanks a lot for detailed explanation. I got the point now. Thanks for your time, Naresh. On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney wrote: > IMHO, if you want this to be more generic, I would have it just take the > full line, and then parse it out. Why? Because what

Re: What is the best way to do counting in pig?

2012-07-02 Thread Sheng Guo
I guess that's the reason, using single reducer may cause some problem when the data is huge, the counting is very time-consuming or even die at the end. What do you mean by counting star to null fileds? can you explain a little more on this? what is the difference between this one and the standar

Re: What is the best way to do counting in pig?

2012-07-02 Thread Jonathan Coveney
group all uses a single reducer, but COUNT is algebraic, and as such, will use combiners, so it is generally quite fast. 2012/7/2 Subir S > Group all - uses single reducer AFAIU. You can try to count per group > and sum may be. > > You may also try with COUNT_STAR to include NULL fields. > > On

Re: What is the best way to do counting in pig?

2012-07-02 Thread Subir S
Group all - uses single reducer AFAIU. You can try to count per group and sum may be. You may also try with COUNT_STAR to include NULL fields. On 7/3/12, Sheng Guo wrote: > Hi all, > > I used to use the following pig script to do the counting of the records. > > m_skill_group = group m_skills_fi

Re: What is the best way to do counting in pig?

2012-07-02 Thread Jonathan Coveney
The code you posted should be performant. a group all -> count is quite fast, so my guess is that there is something else going on. can you paste your whole script? 2012/7/2 Sheng Guo > No. I try to figure out how many records (rows) in 'm_skill_group' table. > (That limit statement actually is

Re: What is the best way to do counting in pig?

2012-07-02 Thread Sheng Guo
No. I try to figure out how many records (rows) in 'm_skill_group' table. (That limit statement actually is not necessary) Thanks! On Mon, Jul 2, 2012 at 1:20 PM, Jonathan Coveney wrote: > Is your goal to have the 10 largest rows by member_id? > > 2012/7/2 Sheng Guo > > > Hi all, > > > > I us

Re: What is the best way to do counting in pig?

2012-07-02 Thread Jonathan Coveney
Is your goal to have the 10 largest rows by member_id? 2012/7/2 Sheng Guo > Hi all, > > I used to use the following pig script to do the counting of the records. > > m_skill_group = group m_skills_filter by member_id; > grpd = group m_skill_group all; > cnt = foreach grpd generate COUNT(m_skill_

Re: Generating multiple tuples from single tuple

2012-07-02 Thread Jonathan Coveney
IMHO, if you want this to be more generic, I would have it just take the full line, and then parse it out. Why? Because what happens when you have an indeterminate number of columns? That's my own pesonal opinion though. As far as implementation, I would return a DataBag (because what you want are

Re: Best Practice: store depending on data content

2012-07-02 Thread Alan Gates
On Jul 2, 2012, at 5:57 AM, Ruslan Al-Fakikh wrote: > Hey Alan, > > I am not familiar with Apache processes, so I could be wrong in my > point 1, I am sorry. I wasn't trying to say you were right or wrong, just trying to understand your perspective. > Basically my impressions was that Cloudera

Re: Generating multiple tuples from single tuple

2012-07-02 Thread naresh
Thanks for the suggestions. @Jonathan Coveney: input tuple : (id1,column1,column2) output : two tuples (id1,column1) and (id2,column2) so it is List or should I return a Bag? public class SPLITTUPPLE extends EvalFunc > { public List exec(Tuple input) throws IOException { if (input

Re: Best Practice: store depending on data content

2012-07-02 Thread Dmitriy Ryaboy
"It would give me the list of datasets in one place accessible from all tools," And that's exactly why you want it. D On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh wrote: > Hey Alan, > > I am not familiar with Apache processes, so I could be wrong in my > point 1, I am sorry. > Basically my

Re: Generating multiple tuples from single tuple

2012-07-02 Thread Jonathan Coveney
You can probably hack together something that will do exactly this without writing a UDF, but I think a UDF will be most useful here...especially if you want to add more columns, etc etc. 2012/7/1 Subir S > Would FLATTEN help? > > B = GROUP A by ID; > > C = FOREACH B GENERATE group, FLATTEN ($1)

Re: NoClassDefFoundError after upgrading to pig 0.10.0 from 0.9.0

2012-07-02 Thread Gianmarco De Francisci Morales
We can simply generate the pom dynamically as we already do with the ivy.xml file. Cheers, -- Gianmarco On Mon, Jul 2, 2012 at 3:58 AM, Dmitriy Ryaboy wrote: > Yep, you are right, the pom is not generated, but checked in > statically; looks like it's out of date. One more reason to mavenize

Re: Best Practice: store depending on data content

2012-07-02 Thread Ruslan Al-Fakikh
Hey Alan, I am not familiar with Apache processes, so I could be wrong in my point 1, I am sorry. Basically my impressions was that Cloudera is pushing Avro format for intercommunications between hadoop tools like pig, hive and mapreduce. https://ccp.cloudera.com/display/CDHDOC/Avro+Usage http://w