Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet *change spark = sparknlp.start()* to spark = sparknlp.start(spark32=True) tir. 19. apr. 2022 kl. 21:10 skrev Bjørn Jørgensen : > Yes, there are some that have that issue. > > Please open a new issue at >

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
Yes, there are some that have that issue. Please open a new issue at https://github.com/JohnSnowLabs/spark-nlp/issues and they will help you. tir. 19. apr. 2022 kl. 20:33 skrev Xavier Gervilla < xavier.gervi...@datapta.com>: > Thank you for your advice, I had small knowledge of Spark NLP and

Re: Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole
I don't want to groupBy since i want the rows separate for the subsequent transformations. But i want to groupBy (i am using partitionBy here) using many attributes while counting the frequency for each different group of records (with respect to the the attributes first mentioned) Le mar. 19

Re: Grouping and counting occurences of specific column rows

2022-04-19 Thread Sean Owen
Just .groupBy(...).count() ? On Tue, Apr 19, 2022 at 6:24 AM marc nicole wrote: > Hello guys, > > I want to group by certain column attributes (e.g.,List > groupByQidAttributes) a dataset (initDataset) and then count the > occurrences of associated grouped rows, how do i achieve that neatly? >

Re: RDD memory use question

2022-04-19 Thread Sean Owen
Don't collect() - that pulls all data into memory. Use count(). On Tue, Apr 19, 2022 at 5:34 AM wilson wrote: > Hello, > > Do you know for a big dataset why the general RDD job can be done, but > the collect() failed due to memory overflow? > > for instance, for a dataset which has xxx million

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Jungtaek Lim
I have no context on ML, but your "streaming" query exposes the possibility of memory issues. *flattenedNER.registerTempTable(**"df"**) >>> >>> >>> querySelect = **"SELECT col as entity, avg(sentiment) as sentiment, >>> count(col) as count FROM df GROUP BY col"** >>> finalDF =

Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole
Hello guys, I want to group by certain column attributes (e.g.,List groupByQidAttributes) a dataset (initDataset) and then count the occurrences of associated grouped rows, how do i achieve that neatly? I tried through the following code: Dataset groupedRowsDF =