Re: How is union() implemented? Need to implement column bind

2022-04-18 Thread Sean Owen
A join is the natural answer, but this is a 10114-way join, which probably chokes readily just to even plan it, let alone all the shuffling and shuffling of huge data. You could tune your way out of it maybe, but not optimistic. It's just huge. You could go off-road and lower-level to take

How is union() implemented? Need to implement column bind

2022-04-18 Thread Andrew Davidson
Hi have a hard problem I have 10114 column vectors each in a separate file. The file has 2 columns, the row id, and numeric values. The row ids are identical and in sort order. All the column vectors have the same number of rows. There are over 5 million rows. I need to combine them into a

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Bjørn Jørgensen
When did SpaCy have support for Spark? Try Spark NLP it`s made for spark. They have a lot of notebooks at https://github.com/JohnSnowLabs/spark-nlp and they public user guides at

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Sean Owen
It looks good, are you sure it even starts? the problem I see is that you send a copy of the model from the driver for every task. Try broadcasting the model instead. I'm not sure if that resolves it but would be a good practice. On Mon, Apr 18, 2022 at 9:10 AM Xavier Gervilla wrote: > Hi Team,

[Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Xavier Gervilla
Hi Team,https://stackoverflow.com/questions/71841814/is-there-a-way-to-prevent-excessive-ram-consumption-with-the-spark-configuration I'm developing a project that retrieves tweets on a 'host' app, streams them to Spark and with different operations with DataFrames obtains the Sentiment of

[Spark Web UI] Integrating Keycloak SSO

2022-04-18 Thread Solomon, Brad
As outlined at https://issues.apache.org/jira/browse/SPARK-38693 and https://stackoverflow.com/q/71667296/7954504, we are attempting to integrate Keycloak Single Sign On with the Spark Web UI. However, Spark errors