Not a max - all values are needed. pivot() if anything is much closer, but
see the rest of this thread.
On Thu, Apr 21, 2022 at 1:19 AM Sonal Goyal wrote:
> Seems like an interesting problem to solve!
>
> If I have understood it correctly, you have 10114 files each with the
> structure
>
>
Seems like an interesting problem to solve!
If I have understood it correctly, you have 10114 files each with the
structure
rowid, colA
r1, a
r2, b
r3, c
...5 million rows
if you union them, you will have
rowid, colA, colB
r1, a, null
r2, b, null
r3, c, null
r1, null, d
r2, null, e
r3,
Oh, Spark directly supports upserts (with the right data destination) and
yeah you could do this as 1+ updates to a table without any pivoting,
etc. It'd still end up being 10K+ single joins along the way but individual
steps are simpler. It might actually be pretty efficient I/O wise as
at 5:34 PM
To: Andrew Davidson
Cc: Andrew Melo , Bjørn Jørgensen
, "user @spark"
Subject: Re: How is union() implemented? Need to implement column bind
Wait, how is all that related to cbind -- very different from what's needed to
insert.
BigQuery is unrelated to MR or Spark. It
I know bigQuery use map reduce like spark.
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
> *From: *Sean Owen
> *Date: *Wednesday, April 20, 2022 at 2:31 PM
> *To: *Andrew Melo
> *Cc: *Andrew Davidson , Bjørn Jørgensen <
> bjornjorgen...@gmail.com>,
.
Kind regards
Andy
From: Sean Owen
Date: Wednesday, April 20, 2022 at 2:31 PM
To: Andrew Melo
Cc: Andrew Davidson , Bjørn Jørgensen
, "user @spark"
Subject: Re: How is union() implemented? Need to implement column bind
I don't think there's fundamental disapproval (it is implemented i
xample I wonder if running join something like
>>>>> BigQuery might work better? I do not know much about the implementation.
>>>>>
>>>>>
>>>>>
>>>>> No one tool will solve all problems. Once I get the matrix I think it
>&
No one tool will solve all problems. Once I get the matrix I think it
>>>> spark will work well for our need
>>>>
>>>>
>>>>
>>>> Kind regards
>>>>
>>>>
>>>>
>>>> Andy
>>>>
>>
r our need
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>> *From: *Sean Owen
>>> *Date: *Monday, April 18, 2022 at 6:58 PM
>>> *To: *Andrew D
t;
>> Andy
>>
>>
>>
>> *From: *Sean Owen
>> *Date: *Monday, April 18, 2022 at 6:58 PM
>> *To: *Andrew Davidson
>> *Cc: *"user @spark"
>> *Subject: *Re: How is union() implemented? Need to implement column bind
>>
: *Sean Owen
> *Date: *Monday, April 18, 2022 at 6:58 PM
> *To: *Andrew Davidson
> *Cc: *"user @spark"
> *Subject: *Re: How is union() implemented? Need to implement column bind
>
>
>
> A join is the natural answer, but this is a 10114-way join, which probably
>
it spark
will work well for our need
Kind regards
Andy
From: Sean Owen
Date: Monday, April 18, 2022 at 6:58 PM
To: Andrew Davidson
Cc: "user @spark"
Subject: Re: How is union() implemented? Need to implement column bind
A join is the natural answer, but this is a 10114-way join, whic
A join is the natural answer, but this is a 10114-way join, which probably
chokes readily just to even plan it, let alone all the shuffling and
shuffling of huge data. You could tune your way out of it maybe, but not
optimistic. It's just huge.
You could go off-road and lower-level to take
Hi have a hard problem
I have 10114 column vectors each in a separate file. The file has 2 columns,
the row id, and numeric values. The row ids are identical and in sort order.
All the column vectors have the same number of rows. There are over 5 million
rows. I need to combine them into a
14 matches
Mail list logo