Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
Oh, Spark directly supports upserts (with the right data destination) and yeah you could do this as 1+ updates to a table without any pivoting, etc. It'd still end up being 10K+ single joins along the way but individual steps are simpler. It might actually be pretty efficient I/O wise as

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
Hi Sean My “insert” solution is hack that might work give we can easily spin up a single VM with a crazy amouts of memory. I would prefer to see a distributed solution. It is just a matter of time before someone want to create an even bigger table using cbind. I understand you probably

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
Wait, how is all that related to cbind -- very different from what's needed to insert. BigQuery is unrelated to MR or Spark. It is however a SQL engine, but, can you express this in SQL without joins? I'm just guessing joining 10K+ tables is hard anywhere. On Wed, Apr 20, 2022 at 7:32 PM Andrew

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
I was thinking about something like bigQuery a little more. I do not know how it is implemented. However I believe traditional relational databases are row oriented and typically run on single machine. You can lock at the row level. This leads me to speculate that row level inserts maybe more

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
I don't think there's fundamental disapproval (it is implemented in sparklyr) just a question of how you make this work at scale in general. It's not a super natural operation in this context but can be done. If you find a successful solution at extremes then maybe it generalizes. On Wed, Apr 20,

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Melo
It would certainly be useful for our domain to have some sort of native cbind(). Is there a fundamental disapproval of adding that functionality, or is it just a matter of nobody implementing it? On Wed, Apr 20, 2022 at 16:28 Sean Owen wrote: > Good lead, pandas on Spark concat() is worth

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
Good lead, pandas on Spark concat() is worth trying. It looks like it uses a join, but not 100% sure from the source. The SQL concat() function is indeed a different thing. On Wed, Apr 20, 2022 at 3:24 PM Bjørn Jørgensen wrote: > Sorry for asking. But why does`t concat work? > > Pandas on spark

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Bjørn Jørgensen
Sorry for asking. But why does`t concat work? Pandas on spark have ps.concat which takes 2 dataframes and concat them to 1 dataframe. It seems

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Sean Owen
cbind? yeah though the answer is typically a join. I don't know if there's a better option in a SQL engine, as SQL doesn't have anything to offer except join and pivot either (? right?) Certainly, the dominant data storage paradigm is wide tables, whereas you're starting with effectively a huge

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
Thanks Sean I imagine this is a fairly common problem in data science. Any idea how other solve? For example I wonder if running join something like BigQuery might work better? I do not know much about the implementation. No one tool will solve all problems. Once I get the matrix I think it

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Xavier Gervilla
Thank you for the flatten function, it has a bigger functionality than what I need for my project but the examples (which were really, really useful) helped me find a solution. Instead of accessing the confidence and entity attributes (metadata.confidence and metadata.entity) I was accessing

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Bjørn Jørgensen
Glad to hear that it works :) Your dataframe is nested with both map, array and struct. I`m using this function to flatten a nested dataframe to rows and columns. from pyspark.sql.types import * from pyspark.sql.functions import * def flatten_test(df, sep="_"): """Returns a flattened