ark wrote:
> Unfortunately, no. The settings are script-wide. Can you add an order-by
> before storing your output and set its parallel to a smaller number? That
> will force a reduce phase and combine small files. Of course, it will add
> extra MR jobs.
>
>
> On Sat, Nov 30, 2013
Is there a way in Pig to change this configuration
(pig.maxCombinedSplitSize) at different steps inside the *same* Pig script?
For example, when I am LOADing the data I want this value to be low so that
we use the block size effectively & many mappers get triggered. (Otherwise,
the job takes too l
Hello,
Is there a way in Pig to go thru a parent-child hierarchy? For example,
let's say I've following data:
ChildParent Value
1 10
10 20
20 30
30 40v30
40 50v40
Now let's say, I look up Child=10, it has no 'Value', so I go to its
ts' worth I agree, it's nicer to have coalesce than the
> > conditional operator.
> >
> > On Sep 4, 2013, at 8:50 AM, Something Something <
> mailinglist...@gmail.com>
> > wrote:
> >
> > > What if you've 10 fields?
> > >
> &
t; Ruslan
>
>
> On Wed, Sep 4, 2013 at 9:50 AM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > Is there a UDF in Piggybank (or somewhere) that will mimic functionality
> of
> > the COALESCE function in MySQL:
> >
> >
> >
> htt
Is there a UDF in Piggybank (or somewhere) that will mimic functionality of
the COALESCE function in MySQL:
http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php
I know it will be very (very) easy to write this, but just don't want to
create one if one already
234
Does this calculation look right?
On Wed, Jul 31, 2013 at 10:28 AM, John Meagher wrote:
> It is file size based, not file count based. For fewer files up the
> max-file-blocks setting.
>
> On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> wrote:
> > Thanks, John. But
tps://github.com/edwardcapriolo/filecrush
>
> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> wrote:
> > Each bz2 file after merging is about 50Megs. The reducers take about 9
> > minutes.
> >
> > Note: 'getmerge' is not an option. There isn't enou
Jul 30, 2013 at 10:34 PM, Ben Juhn wrote:
> How big are your 50 files? How long are the reducers taking?
>
> On Jul 30, 2013, at 10:26 PM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > Hello,
> >
> > One of our pig scripts creates over 500 s
Hello,
One of our pig scripts creates over 500 small part files. To save on
namespace, we need to cut down the # of files, so instead of saving 500
small files we need to merge them into 50. We tried the following:
1) When we set parallel number to 50, the Pig script takes a long time -
for ob
Hive since
> you
> > > have SQL like syntax. (Haven't used Hive, but it looks like this type
> of
> > > thing would be far more natural in Hive)
> > >
> > >
> > > On Thu, Jul 18, 2013 at 12:09 PM, Something Something <
> > > maili
On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota wrote:
> Looks like this might be macroable. Not entirely sure how that can be done
> yet... but I'd look into that if I were you.
>
>
> On Thu, Jul 18, 2013 at 11:16 AM, Something Something <
> mailinglist...@gm
t;
> On Thu, Jul 18, 2013 at 8:44 AM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > There must be a better way to do this in Pig. Here's how my script looks
> > like right now: (omitted some snippet for saving space, but you will get
> >
There must be a better way to do this in Pig. Here's how my script looks
like right now: (omitted some snippet for saving space, but you will get
the idea).
FACT_TABLE = LOAD 'XYZ' as (col1 :chararray,………. col30: chararray);
FACT_TABLE1 = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as col2,…
Hello,
I am running a Pig script which internally starts several jobs. For one of
the jobs that uses maximum no. of mappers & reducers, I need to find out
how many machines it's running on & which machines are those.
I looked around the JobTracker UI, but couldn't find this information. Is
it t
Hello,
Our data is skewed, so we are using a 'skewed' join but still the 'Join'
operation is taking a long time. From the documentation, it appears Pig
samples data & creates a file that is passed using 'pig.keyDistFile'
config. It also appears that for our data this sample is a bit biased.
Our
t; > CC: u...@hadoop.apache.org
> > To: user@pig.apache.org
>
> >
> > What process creates the data in HDFS? You should be able to set the
> block size there and avoid the copy.
> >
> > I would test the dfs.block.size on the copy and see if you get the
>
plitSize
> to something around the block size
>
> David
>
> On Feb 11, 2013, at 1:24 PM, Something Something
> wrote:
>
> > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
> > HBase. Adding 'hadoop' user gr
Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to
HBase. Adding 'hadoop' user group.
On Mon, Feb 11, 2013 at 10:22 AM, Something Something <
mailinglist...@gmail.com> wrote:
> Hello,
>
> We are running into performance issues with
to turn it off.
*/
public synchronized void setQuietMode(boolean quietmode) {
this.quietmode = quietmode;
}
Can someone tell me how to force call to this? Apologies in advance for my
dumbness.
On Wed, Mar 7, 2012 at 10:30 PM, Something Something <
mailinglist...@gmail.com> wrot
you,
> Manish
> Sent from my BlackBerry, pls excuse typo
>
> -Original Message-
> From: Something Something
> Date: Wed, 7 Mar 2012 22:30:05
> To: ; ; >
> Reply-To: user@pig.apache.org
> Subject: org.apache.hadoop.conf.Configuration - error parsing conf fil
Hello,
I am using: hadoop-0.20.2-cdh3u2, hbase-0.90.4-cdh3u3, pig-0.8.1-cdh3u3
I have successfully loaded data into HBase tables (implying my Hadoop &
HBase setup is good). I can look at the data using HBase shell.
Now I am trying to read data from HBase via a Pig Script. My test script
looks
machines by
> default). Pre-distributing sounds tedious and error prone to me. What if
> you have different jobs that require different versions of the same
> dependency?
>
>
> HTH,
> Friso
>
>
>
>
>
> On 16 nov. 2011, at 15:42, Something Something wro
r patience & help with our questions.
On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
mailinglist...@gmail.com> wrote:
> Hmm... there must be a different way 'cause we don't need to do that to
> run Pig jobs.
>
>
> On Tue, Nov 15, 2011 at 10:58 PM, Daa
the machine once
> the job starts. Is that an option?
>
> Daan.
>
> On 16 Nov 2011, at 07:24, Something Something wrote:
>
> > Until now we were manually copying our Jars to all machines in a Hadoop
> > cluster. This used to work until our cluster size was small. Now our
&
Until now we were manually copying our Jars to all machines in a Hadoop
cluster. This used to work until our cluster size was small. Now our
cluster is getting bigger. What's the best way to start a Hadoop Job that
automatically distributes the Jar to all machines in a cluster?
I read the doc a
Is there a limit on:
1) How long the $FILES string can be?
2) Total # of input paths to process?
when I do this in my Pig script...
*LOAD '$FILES'*
*AS (xyz:chararray, abc:int);*
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to
process : 4430
Tha
> reduce plans. There are no distinct reduce tasks for each group operation.
>
> -Thejas
>
>
>
> On 10/3/11 9:35 PM, Something Something wrote:
>
>> Let me ask the question differently. Let's say I was not using Pig. I
>> wanted to do this just using Java MapR
27;re going to load and scan the data twice.
>
> However, as in your case, if you instead combine the load, then you'd have
>
> a = load 'thing';
> {..stuff using a..}
> {..stuff using a (which previously used b)..}
>
> Now it will just scan a once, and then go
I have 3 Pig scripts that load data from the same log file, but filter &
group this data differently. If I combine these 3 into one & LOAD only
once, performance seems to have improved, but now I am curious exactly what
does LOAD do?
How does LOAD work internally? Does Pig save results of the LO
30 matches
Mail list logo