check bag joins in http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html
you could bag contents of the 5 or more columns and then join ... ?? *Cheers !!* Arvind On Tue, Jul 28, 2015 at 12:14 PM, paul green <[email protected]> wrote: > Hi > > Thank you for your suggestion. I had thought of using the UNION function > but thought if there was a more efficient way to do it it would be a great > feature. > > > Two joins and a union would be okay for two columns but would be less > efficient if I wanted to check again more columns. So to see if any value > value from a column in dataset 1 was in columns 2,3,4,5,6 of dataset 2. > > > The only was I could see of doing it would be to do 5 joins and then a > union. This just feels a like a bad way to do a lookup across many columns > for a single colum. > > Thanks in advance. > Paul > ________________________________ > From: Arvind S<mailto:[email protected]> > Sent: 28/07/2015 05:04 > To: [email protected]<mailto:[email protected]> > Subject: Re: eqijoin 1 field in dataset to 2 fields in another datasets > using OR > > Suggestion : you can create a join for each column individually ..and then > union the result.. ?? > > http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#UNION > > *Cheers !!* > Arvind > > On Tue, Jul 28, 2015 at 1:30 AM, paul green <[email protected]> wrote: > > > HelloI use Pig at home (currently version 0.13.0) regularly on data sets > > that vary between 10's Megabytes and 10's Gigabytes. I wanted to be able > to > > join two data sets together (ideally filtering). The main problem I am > > having and have not found an easily solution is:I want to join data set 1 > > to data set 2 like below.data1.txtid, name, job0001,john, > manager0002,phil, > > deputydata2.txtid1, id2, id3, > > label0001,0002,0001,useful0005,0001,0001,useful0000,0010,0009,not > > usefulCode ProposaldatasetA = LOAD 'data1.txt' USING PigStorage(',') AS > > (fieldA1, fieldA2, fieldA3);datasetB = LOAD 'data2.txt' USING > > PigStorage(',') AS (fieldB1, fieldB2, fieldB3, fieldB4);joined = JOIN > > datasetA BY fieldA1, datasetB BY (fieldB1 OR > fieldB2 > > OR fieldB3);DUMP joined;So essentially I want to join 1 column to n > columns > > in the second data set where they are equal. I am not after a partial > join > > but an exact join. Is there a feature already in the language to do this, > > if not, would it be possible to request such a feature?Thanks. > > >
