Sure thanks for confirming, will do :)
-Matt
From: Gianmarco De Francisci Morales [g...@apache.org]
Sent: Monday, July 02, 2012 6:36 AM
To: user@pig.apache.org
Subject: Re: NoClassDefFoundError after upgrading to pig 0.10.0 from 0.9.0
We can simply generat
Does the replace function replace adjacent occurrences of the string or does
one need to specify it using regex?
Thanks,
Ranjith
Have you tried breaking it into 2 jobs? The first are the pre-sort work then a
final job with the sort and single reducer?
Will Duckworth Senior Vice President, Software Engineering | comScore,
Inc.(NASDAQ:SCOR)
o +1 (703) 438-2108 | m +1 (301) 606-2977 | mailto:wduckwo...@comscore.com
Hi Guys,
I have use case, where I need to generate data feed using Pig script. Data
feed in total is of about 12 GB.
I want Pig script to generate 1 file and data in that data should be sorted
as well. I know I can run it with one reducer as dataset is big with lot of
Joins it takes forever to fi
Thanks!
Unfortunately I'm stuck with pig 0.9.2 so I'm looking for an example in Java.
On Tue, Jul 3, 2012 at 1:03 AM, Russell Jurney wrote:
> Checkout chapter 3 of my book for a tutorial. Be sure to use pig 0.10.
>
> http://ofps.oreilly.com/titles/9781449326265/chapter_3.html
>
> Russell Jurney
Checkout chapter 3 of my book for a tutorial. Be sure to use pig 0.10.
http://ofps.oreilly.com/titles/9781449326265/chapter_3.html
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com
On Jul 2, 2012, at 3:50 PM, Fabian Alenius wrote:
> Hi,
>
> does anyone happen to have
Hi,
does anyone happen to have a sample of how to load a avro record from HDFS
given a location. In my case the schema is just "binary".
I'm working on a custom loader and I've been playing around with the avro
API, but so far no luck.
Thanks,
Fabian
Right!!
Since it is mentioned that job is hanging, wild guess is it must be
'group all'. How can that be confirmed?
On 7/3/12, Jonathan Coveney wrote:
> group all uses a single reducer, but COUNT is algebraic, and as such, will
> use combiners, so it is generally quite fast.
>
> 2012/7/2 Subir S
@Jonathan Conveney:
Thanks a lot for detailed explanation. I got the point now.
Thanks for your time,
Naresh.
On Mon, Jul 2, 2012 at 1:19 PM, Jonathan Coveney wrote:
> IMHO, if you want this to be more generic, I would have it just take the
> full line, and then parse it out. Why? Because what
I guess that's the reason, using single reducer may cause some problem when
the data is huge, the counting is very time-consuming or even die at the
end.
What do you mean by counting star to null fileds? can you explain a little
more on this? what is the difference between this one and the standar
group all uses a single reducer, but COUNT is algebraic, and as such, will
use combiners, so it is generally quite fast.
2012/7/2 Subir S
> Group all - uses single reducer AFAIU. You can try to count per group
> and sum may be.
>
> You may also try with COUNT_STAR to include NULL fields.
>
> On
Group all - uses single reducer AFAIU. You can try to count per group
and sum may be.
You may also try with COUNT_STAR to include NULL fields.
On 7/3/12, Sheng Guo wrote:
> Hi all,
>
> I used to use the following pig script to do the counting of the records.
>
> m_skill_group = group m_skills_fi
The code you posted should be performant. a group all -> count is quite
fast, so my guess is that there is something else going on. can you paste
your whole script?
2012/7/2 Sheng Guo
> No. I try to figure out how many records (rows) in 'm_skill_group' table.
> (That limit statement actually is
No. I try to figure out how many records (rows) in 'm_skill_group' table.
(That limit statement actually is not necessary)
Thanks!
On Mon, Jul 2, 2012 at 1:20 PM, Jonathan Coveney wrote:
> Is your goal to have the 10 largest rows by member_id?
>
> 2012/7/2 Sheng Guo
>
> > Hi all,
> >
> > I us
Is your goal to have the 10 largest rows by member_id?
2012/7/2 Sheng Guo
> Hi all,
>
> I used to use the following pig script to do the counting of the records.
>
> m_skill_group = group m_skills_filter by member_id;
> grpd = group m_skill_group all;
> cnt = foreach grpd generate COUNT(m_skill_
IMHO, if you want this to be more generic, I would have it just take the
full line, and then parse it out. Why? Because what happens when you have
an indeterminate number of columns? That's my own pesonal opinion though.
As far as implementation, I would return a DataBag (because what you want
are
On Jul 2, 2012, at 5:57 AM, Ruslan Al-Fakikh wrote:
> Hey Alan,
>
> I am not familiar with Apache processes, so I could be wrong in my
> point 1, I am sorry.
I wasn't trying to say you were right or wrong, just trying to understand your
perspective.
> Basically my impressions was that Cloudera
Thanks for the suggestions.
@Jonathan Coveney:
input tuple : (id1,column1,column2)
output : two tuples (id1,column1) and (id2,column2) so it is List
or should I return a Bag?
public class SPLITTUPPLE extends EvalFunc >
{
public List exec(Tuple input) throws IOException {
if (input
"It would give me the list of datasets in one place accessible from all
tools,"
And that's exactly why you want it.
D
On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh wrote:
> Hey Alan,
>
> I am not familiar with Apache processes, so I could be wrong in my
> point 1, I am sorry.
> Basically my
You can probably hack together something that will do exactly this without
writing a UDF, but I think a UDF will be most useful here...especially if
you want to add more columns, etc etc.
2012/7/1 Subir S
> Would FLATTEN help?
>
> B = GROUP A by ID;
>
> C = FOREACH B GENERATE group, FLATTEN ($1)
We can simply generate the pom dynamically as we already do with the
ivy.xml file.
Cheers,
--
Gianmarco
On Mon, Jul 2, 2012 at 3:58 AM, Dmitriy Ryaboy wrote:
> Yep, you are right, the pom is not generated, but checked in
> statically; looks like it's out of date. One more reason to mavenize
Hey Alan,
I am not familiar with Apache processes, so I could be wrong in my
point 1, I am sorry.
Basically my impressions was that Cloudera is pushing Avro format for
intercommunications between hadoop tools like pig, hive and mapreduce.
https://ccp.cloudera.com/display/CDHDOC/Avro+Usage
http://w
22 matches
Mail list logo