Oh, Spark directly supports upserts (with the right data destination) and
yeah you could do this as 10000+ updates to a table without any pivoting,
etc. It'd still end up being 10K+ single joins along the way but individual
steps are simpler. It might actually be pretty efficient I/O wise as
columnar formats would not rewrite any other data on a write like this.

On Wed, Apr 20, 2022 at 8:09 PM Andrew Davidson <aedav...@ucsc.edu> wrote:

> Hi Sean
>
>
>
> My “insert” solution is hack that might work give we can easily spin up a
> single VM with a crazy amouts of memory. I would prefer to see a
> distributed solution. It is just a matter of time before someone want to
> create an even bigger table using cbind.
>
>
>
> I understand you probably already know a lot about traditional RDBS’s.
> Much of my post is back ground for others
>
>
>
> I used to do some of classic relational database work before tools like
> Hadoop, spark and NoSQL became available .
>
>
>
> The standard operations on a single table in a relation database are
>
>
>
> Insert “row”. This is similar to spark union.  Typically primary keys in
>  in rbdms tables are indexed  to enable quick look up. So insert is
> probably not 1 for. 1 with union. The row may not simply be appended to the
> end of the table.
>
>
>
> Update a “row”
>
> Delete a “row”
>
> Select “rows where”
>
>
>
> Rdms server enable row and table level locking. Data must always be in a
> consistent state. You must commit or abort you changes for them to persist
> and to release locks on the data. Locks are required because you have a
> single resource and may user requesting service simultaneously. This is
> very different from Spark
>
>
>
> Storage and memory used to be really expensive so often people tried to
> create “1st normal form” schemas. I.E. no duplicate data to reduce
> hardware cost.  1st normal design require you to use joins to the get
> data table you want. Joins are expensive. Often design duplicated some data
> to improve performance by minimize the number of joins required. Duplicate
> data make maintaining consistency harder. There are other advantages to
> normalized data design and as we are all aware in the bigdata world lots of
> disadvantages. The dbms ran on a single big machine. Join was not
> implemented as distributed map/reduce.
>
>
>
> So My idea is use a traditional RDMS server: my final table will have 5
> million rows and 10,114 columns.
>
>    1. Read the column vector from each of 10,114 data files
>    2. insert the column vector as a row in the table
>       1. I read a file that has a single element on each line. All I need
>       to do is replace \n with ,
>    3. Now I have table with 10,115 rows and 5 million columns
>    4. The row id (primary key) is the original file name
>    5. The columns are the row ids in the original column vectors
>    6. Now all I need to do is pivot this single table to get what I want.
>    This is the only join or map/reduce like operation
>    7. A table with 5million rows and 10,114 columns
>
>
>
>
>
> My final table is about 220 gb. I know at google my I have quota for up 2
> mega mem machines. Each one has some think like 1.4 Tb of memory
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
>

Reply via email to