Oh, Spark directly supports upserts (with the right data destination) and yeah you could do this as 10000+ updates to a table without any pivoting, etc. It'd still end up being 10K+ single joins along the way but individual steps are simpler. It might actually be pretty efficient I/O wise as columnar formats would not rewrite any other data on a write like this.
On Wed, Apr 20, 2022 at 8:09 PM Andrew Davidson <aedav...@ucsc.edu> wrote: > Hi Sean > > > > My “insert” solution is hack that might work give we can easily spin up a > single VM with a crazy amouts of memory. I would prefer to see a > distributed solution. It is just a matter of time before someone want to > create an even bigger table using cbind. > > > > I understand you probably already know a lot about traditional RDBS’s. > Much of my post is back ground for others > > > > I used to do some of classic relational database work before tools like > Hadoop, spark and NoSQL became available . > > > > The standard operations on a single table in a relation database are > > > > Insert “row”. This is similar to spark union. Typically primary keys in > in rbdms tables are indexed to enable quick look up. So insert is > probably not 1 for. 1 with union. The row may not simply be appended to the > end of the table. > > > > Update a “row” > > Delete a “row” > > Select “rows where” > > > > Rdms server enable row and table level locking. Data must always be in a > consistent state. You must commit or abort you changes for them to persist > and to release locks on the data. Locks are required because you have a > single resource and may user requesting service simultaneously. This is > very different from Spark > > > > Storage and memory used to be really expensive so often people tried to > create “1st normal form” schemas. I.E. no duplicate data to reduce > hardware cost. 1st normal design require you to use joins to the get > data table you want. Joins are expensive. Often design duplicated some data > to improve performance by minimize the number of joins required. Duplicate > data make maintaining consistency harder. There are other advantages to > normalized data design and as we are all aware in the bigdata world lots of > disadvantages. The dbms ran on a single big machine. Join was not > implemented as distributed map/reduce. > > > > So My idea is use a traditional RDMS server: my final table will have 5 > million rows and 10,114 columns. > > 1. Read the column vector from each of 10,114 data files > 2. insert the column vector as a row in the table > 1. I read a file that has a single element on each line. All I need > to do is replace \n with , > 3. Now I have table with 10,115 rows and 5 million columns > 4. The row id (primary key) is the original file name > 5. The columns are the row ids in the original column vectors > 6. Now all I need to do is pivot this single table to get what I want. > This is the only join or map/reduce like operation > 7. A table with 5million rows and 10,114 columns > > > > > > My final table is about 220 gb. I know at google my I have quota for up 2 > mega mem machines. Each one has some think like 1.4 Tb of memory > > > > Kind regards > > > > Andy > > > >