Wasn't your initial requirement different? You mentioned "seconddir" had a
different schema from "firstdir", in which case simply loading both
together and grouping by (A,B,C,D) will produce unexpected results.

If you can make sure both datasets have the same schema, yes THAT would be
better.

On Mon, Feb 13, 2012 at 11:54 AM, jagaran das <jagaran_...@yahoo.co.in>wrote:

> Thanks
>
> Best would be then
> A = Load '/home/hadoop/{test/firstdir,test/seconddir}' using
> PigStorage('\t') as (A,B,C,D)
> B = group A by (A,B,C,D)
>
> Ignore E while loading and make sure both first and second field is in
> same order A B C D.
>
> Thanks
> Jagaran
>   ------------------------------
> *From:* Prashant Kommireddi <prash1...@gmail.com>
> *To:* jagaran das <jagaran_...@yahoo.co.in>
> *Sent:* Monday, 13 February 2012 11:36 AM
>
> *Subject:* Re: Fw: Hadoop Cluster Question
>
> I can suggest a dirty hack for this
>
>    1. A = load 'firstdir' as (a,b,c,d,e);
>    2. B = load 'seconddir';
>    3. C = foreach B generate $0 as a, $4 as b, $2 as c, $3 as d, $4 as e;
>    4. D = UNION A, C;
>    5. E = Group D by (a,b,c,d);
>
> Thanks,
> Prashant
>
>
> On Mon, Feb 13, 2012 at 11:22 AM, jagaran das <jagaran_...@yahoo.co.in>wrote:
>
> Hi,
>
> I have a requirement in Pig, Where I have to read from two diff
> directories but the ordering of field is different.
>
> A = Load '/home/hadoop/{test/firstdir,test/seconddir}' using
> PigStorage('\t') as (A,B,C,D,E)
> B = group A by (A,B,C,D)
>
> now firstdir as the fields in order A B C D E but the second dir has the
> data in order A,C,D,E,B
>
> Is there any way to take read because my groupby clause contains (A,B,C,D
> )?
>
> Thanks
> Jagaran
>
>
>    ------------------------------
> *From:* Prashant Kommireddi <prash1...@gmail.com>
> *To:* user@pig.apache.org; jagaran das <jagaran_...@yahoo.co.in>
> *Sent:* Sunday, 12 February 2012 9:48 PM
> *Subject:* Re: Fw: Hadoop Cluster Question
>
>
>    1. Yes the job would fail
>    2. Yes any new job would fail until local disk space is made available
>    3. If there are too many failures from a particular node, after a few
>    failures that node would be blacklisted.
>
> Is that slave node being more utliized due to a particular job, or is just
> a general phenomenon?
> Take a look at
> http://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.html#Rebalancer
> .
> Thanks,
> Prashant
>
> On Sun, Feb 12, 2012 at 9:36 PM, jagaran das <jagaran_...@yahoo.co.in>wrote:
>
>
>
>
> ----- Forwarded Message -----
> From: jagaran das <jagaran_...@yahoo.co.in>
> To: "common-u...@hadoop.apache.org" <common-u...@hadoop.apache.org>
> Sent: Sunday, 12 February 2012 9:33 PM
> Subject: Hadoop Cluster Question
>
>
> Hi,
> A. If One of the Slave Node local disc space is full in a cluster ?
>
> 1. Would a already started running Pig job fail ?
> 2. Any new started pig job would fail ?
> 3. How would the Hadoop Cluster Behave ? Would that be a dead node ?
>
> B. In our production cluster we are seeing one of the slave node is being
> more utilized than the others.
> By Utilization I mean the %DFS is always more in it. How can we balance it
> ?
>
> Thanks,
> Jagaran
>
>
>
>
>
>
>
>

Reply via email to