Re: Changing pig.maxCombinedSplitSize dynamically in single run

2013-12-02 Thread Something Something
ark wrote: > Unfortunately, no. The settings are script-wide. Can you add an order-by > before storing your output and set its parallel to a smaller number? That > will force a reduce phase and combine small files. Of course, it will add > extra MR jobs. > > > On Sat, Nov 30, 2013

Changing pig.maxCombinedSplitSize dynamically in single run

2013-11-30 Thread Something Something
Is there a way in Pig to change this configuration (pig.maxCombinedSplitSize) at different steps inside the *same* Pig script? For example, when I am LOADing the data I want this value to be low so that we use the block size effectively & many mappers get triggered. (Otherwise, the job takes too l

Parent Child Relationships in Pig

2013-10-24 Thread Something Something
Hello, Is there a way in Pig to go thru a parent-child hierarchy? For example, let's say I've following data: ChildParent Value 1 10 10 20 20 30 30 40v30 40 50v40 Now let's say, I look up Child=10, it has no 'Value', so I go to its

Re: COALESCE UDF?

2013-09-04 Thread Something Something
ts' worth I agree, it's nicer to have coalesce than the > > conditional operator. > > > > On Sep 4, 2013, at 8:50 AM, Something Something < > mailinglist...@gmail.com> > > wrote: > > > > > What if you've 10 fields? > > > > &

Re: COALESCE UDF?

2013-09-04 Thread Something Something
t; Ruslan > > > On Wed, Sep 4, 2013 at 9:50 AM, Something Something < > mailinglist...@gmail.com> wrote: > > > Is there a UDF in Piggybank (or somewhere) that will mimic functionality > of > > the COALESCE function in MySQL: > > > > > > > htt

COALESCE UDF?

2013-09-03 Thread Something Something
Is there a UDF in Piggybank (or somewhere) that will mimic functionality of the COALESCE function in MySQL: http://www.w3resource.com/mysql/comparision-functions-and-operators/coalesce-function.php I know it will be very (very) easy to write this, but just don't want to create one if one already

Re: Merging files

2013-07-31 Thread Something Something
234 Does this calculation look right? On Wed, Jul 31, 2013 at 10:28 AM, John Meagher wrote: > It is file size based, not file count based. For fewer files up the > max-file-blocks setting. > > On Wed, Jul 31, 2013 at 12:21 PM, Something Something > wrote: > > Thanks, John. But

Re: Merging files

2013-07-31 Thread Something Something
tps://github.com/edwardcapriolo/filecrush > > On Wed, Jul 31, 2013 at 2:40 AM, Something Something > wrote: > > Each bz2 file after merging is about 50Megs. The reducers take about 9 > > minutes. > > > > Note: 'getmerge' is not an option. There isn't enou

Re: Merging files

2013-07-30 Thread Something Something
Jul 30, 2013 at 10:34 PM, Ben Juhn wrote: > How big are your 50 files? How long are the reducers taking? > > On Jul 30, 2013, at 10:26 PM, Something Something < > mailinglist...@gmail.com> wrote: > > > Hello, > > > > One of our pig scripts creates over 500 s

Merging files

2013-07-30 Thread Something Something
Hello, One of our pig scripts creates over 500 small part files. To save on namespace, we need to cut down the # of files, so instead of saving 500 small files we need to merge them into 50. We tried the following: 1) When we set parallel number to 50, the Pig script takes a long time - for ob

Re: Getting dimension values for Facts

2013-07-19 Thread Something Something
Hive since > you > > > have SQL like syntax. (Haven't used Hive, but it looks like this type > of > > > thing would be far more natural in Hive) > > > > > > > > > On Thu, Jul 18, 2013 at 12:09 PM, Something Something < > > > maili

Re: Getting dimension values for Facts

2013-07-18 Thread Something Something
On Thu, Jul 18, 2013 at 8:25 AM, Pradeep Gollakota wrote: > Looks like this might be macroable. Not entirely sure how that can be done > yet... but I'd look into that if I were you. > > > On Thu, Jul 18, 2013 at 11:16 AM, Something Something < > mailinglist...@gm

Re: Getting dimension values for Facts

2013-07-18 Thread Something Something
t; > On Thu, Jul 18, 2013 at 8:44 AM, Something Something < > mailinglist...@gmail.com> wrote: > > > There must be a better way to do this in Pig. Here's how my script looks > > like right now: (omitted some snippet for saving space, but you will get > >

Getting dimension values for Facts

2013-07-17 Thread Something Something
There must be a better way to do this in Pig. Here's how my script looks like right now: (omitted some snippet for saving space, but you will get the idea). FACT_TABLE = LOAD 'XYZ' as (col1 :chararray,………. col30: chararray); FACT_TABLE1 = FOREACH FACT_TABLE GENERATE col1, udf1(col2) as col2,…

How many machines did my MR job used?

2013-06-20 Thread Something Something
Hello, I am running a Pig script which internally starts several jobs. For one of the jobs that uses maximum no. of mappers & reducers, I need to find out how many machines it's running on & which machines are those. I looked around the JobTracker UI, but couldn't find this information. Is it t

pig.keyDistFile

2013-05-22 Thread Something Something
Hello, Our data is skewed, so we are using a 'skewed' join but still the 'Join' operation is taking a long time. From the documentation, it appears Pig samples data & creates a file that is passed using 'pig.keyDistFile' config. It also appears that for our data this sample is a bit biased. Our

Re: Loader for small files

2013-02-12 Thread Something Something
t; > CC: u...@hadoop.apache.org > > To: user@pig.apache.org > > > > > What process creates the data in HDFS? You should be able to set the > block size there and avoid the copy. > > > > I would test the dfs.block.size on the copy and see if you get the >

Re: Loader for small files

2013-02-11 Thread Something Something
plitSize > to something around the block size > > David > > On Feb 11, 2013, at 1:24 PM, Something Something > wrote: > > > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to > > HBase. Adding 'hadoop' user gr

Re: Loader for small files

2013-02-11 Thread Something Something
Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to HBase. Adding 'hadoop' user group. On Mon, Feb 11, 2013 at 10:22 AM, Something Something < mailinglist...@gmail.com> wrote: > Hello, > > We are running into performance issues with

Re: org.apache.hadoop.conf.Configuration - error parsing conf file

2012-03-08 Thread Something Something
to turn it off. */ public synchronized void setQuietMode(boolean quietmode) { this.quietmode = quietmode; } Can someone tell me how to force call to this? Apologies in advance for my dumbness. On Wed, Mar 7, 2012 at 10:30 PM, Something Something < mailinglist...@gmail.com> wrot

Re: org.apache.hadoop.conf.Configuration - error parsing conf file

2012-03-08 Thread Something Something
you, > Manish > Sent from my BlackBerry, pls excuse typo > > -Original Message- > From: Something Something > Date: Wed, 7 Mar 2012 22:30:05 > To: ; ; > > Reply-To: user@pig.apache.org > Subject: org.apache.hadoop.conf.Configuration - error parsing conf fil

org.apache.hadoop.conf.Configuration - error parsing conf file

2012-03-07 Thread Something Something
Hello, I am using: hadoop-0.20.2-cdh3u2, hbase-0.90.4-cdh3u3, pig-0.8.1-cdh3u3 I have successfully loaded data into HBase tables (implying my Hadoop & HBase setup is good). I can look at the data using HBase shell. Now I am trying to read data from HBase via a Pig Script. My test script looks

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
machines by > default). Pre-distributing sounds tedious and error prone to me. What if > you have different jobs that require different versions of the same > dependency? > > > HTH, > Friso > > > > > > On 16 nov. 2011, at 15:42, Something Something wro

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
r patience & help with our questions. On Wed, Nov 16, 2011 at 6:29 AM, Something Something < mailinglist...@gmail.com> wrote: > Hmm... there must be a different way 'cause we don't need to do that to > run Pig jobs. > > > On Tue, Nov 15, 2011 at 10:58 PM, Daa

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
the machine once > the job starts. Is that an option? > > Daan. > > On 16 Nov 2011, at 07:24, Something Something wrote: > > > Until now we were manually copying our Jars to all machines in a Hadoop > > cluster. This used to work until our cluster size was small. Now our &

Distributing our jars to all machines in a cluster

2011-11-15 Thread Something Something
Until now we were manually copying our Jars to all machines in a Hadoop cluster. This used to work until our cluster size was small. Now our cluster is getting bigger. What's the best way to start a Hadoop Job that automatically distributes the Jar to all machines in a cluster? I read the doc a

Is there a limit on # of files in input paths?

2011-10-18 Thread Something Something
Is there a limit on: 1) How long the $FILES string can be? 2) Total # of input paths to process? when I do this in my Pig script... *LOAD '$FILES'* *AS (xyz:chararray, abc:int);* org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 4430 Tha

Re: LOAD once, use multiple times

2011-10-04 Thread Something Something
> reduce plans. There are no distinct reduce tasks for each group operation. > > -Thejas > > > > On 10/3/11 9:35 PM, Something Something wrote: > >> Let me ask the question differently. Let's say I was not using Pig. I >> wanted to do this just using Java MapR

Re: LOAD once, use multiple times

2011-10-03 Thread Something Something
27;re going to load and scan the data twice. > > However, as in your case, if you instead combine the load, then you'd have > > a = load 'thing'; > {..stuff using a..} > {..stuff using a (which previously used b)..} > > Now it will just scan a once, and then go

LOAD once, use multiple times

2011-10-03 Thread Something Something
I have 3 Pig scripts that load data from the same log file, but filter & group this data differently. If I combine these 3 into one & LOAD only once, performance seems to have improved, but now I am curious exactly what does LOAD do? How does LOAD work internally? Does Pig save results of the LO