Re: How is union() implemented? Need to implement column bind

2022-04-21 Thread Sean Owen
Not a max - all values are needed. pivot() if anything is much closer, but see the rest of this thread. On Thu, Apr 21, 2022 at 1:19 AM Sonal Goyal wrote: > Seems like an interesting problem to solve! > > If I have understood it correctly, you have 10114 files each with the > structure > >

Re: How is union() implemented? Need to implement column bind

2022-04-21 Thread Sonal Goyal
Seems like an interesting problem to solve! If I have understood it correctly, you have 10114 files each with the structure rowid, colA r1, a r2, b r3, c ...5 million rows if you union them, you will have rowid, colA, colB r1, a, null r2, b, null r3, c, null r1, null, d r2, null, e r3,

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
Oh, Spark directly supports upserts (with the right data destination) and yeah you could do this as 1+ updates to a table without any pivoting, etc. It'd still end up being 10K+ single joins along the way but individual steps are simpler. It might actually be pretty efficient I/O wise as

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
at 5:34 PM To: Andrew Davidson Cc: Andrew Melo , Bjørn Jørgensen , "user @spark" Subject: Re: How is union() implemented? Need to implement column bind Wait, how is all that related to cbind -- very different from what's needed to insert. BigQuery is unrelated to MR or Spark. It

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
I know bigQuery use map reduce like spark. > > > > Kind regards > > > > Andy > > > > *From: *Sean Owen > *Date: *Wednesday, April 20, 2022 at 2:31 PM > *To: *Andrew Melo > *Cc: *Andrew Davidson , Bjørn Jørgensen < > bjornjorgen...@gmail.com>,

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
. Kind regards Andy From: Sean Owen Date: Wednesday, April 20, 2022 at 2:31 PM To: Andrew Melo Cc: Andrew Davidson , Bjørn Jørgensen , "user @spark" Subject: Re: How is union() implemented? Need to implement column bind I don't think there's fundamental disapproval (it is implemented i

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
xample I wonder if running join something like >>>>> BigQuery might work better? I do not know much about the implementation. >>>>> >>>>> >>>>> >>>>> No one tool will solve all problems. Once I get the matrix I think it >&

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Melo
No one tool will solve all problems. Once I get the matrix I think it >>>> spark will work well for our need >>>> >>>> >>>> >>>> Kind regards >>>> >>>> >>>> >>>> Andy >>>> >>

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
r our need >>> >>> >>> >>> Kind regards >>> >>> >>> >>> Andy >>> >>> >>> >>> *From: *Sean Owen >>> *Date: *Monday, April 18, 2022 at 6:58 PM >>> *To: *Andrew D

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Bjørn Jørgensen
t; >> Andy >> >> >> >> *From: *Sean Owen >> *Date: *Monday, April 18, 2022 at 6:58 PM >> *To: *Andrew Davidson >> *Cc: *"user @spark" >> *Subject: *Re: How is union() implemented? Need to implement column bind >>

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
: *Sean Owen > *Date: *Monday, April 18, 2022 at 6:58 PM > *To: *Andrew Davidson > *Cc: *"user @spark" > *Subject: *Re: How is union() implemented? Need to implement column bind > > > > A join is the natural answer, but this is a 10114-way join, which probably >

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
it spark will work well for our need Kind regards Andy From: Sean Owen Date: Monday, April 18, 2022 at 6:58 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: How is union() implemented? Need to implement column bind A join is the natural answer, but this is a 10114-way join, whic

Re: How is union() implemented? Need to implement column bind

2022-04-18 Thread Sean Owen
A join is the natural answer, but this is a 10114-way join, which probably chokes readily just to even plan it, let alone all the shuffling and shuffling of huge data. You could tune your way out of it maybe, but not optimistic. It's just huge. You could go off-road and lower-level to take

How is union() implemented? Need to implement column bind

2022-04-18 Thread Andrew Davidson
Hi have a hard problem I have 10114 column vectors each in a separate file. The file has 2 columns, the row id, and numeric values. The row ids are identical and in sort order. All the column vectors have the same number of rows. There are over 5 million rows. I need to combine them into a