GroupBy issue while running K-Means - Dataframe

2020-06-16 Thread Deepak Sharma
Hi All, I have a custom implementation of K-Means where it needs the data to be grouped by a key in a dataframe. Now there is a big data skew for some of the keys , where it exceeds the BufferHolder: Cannot grow BufferHolder by size 17112 because the size after growing exceeds size limitation

Spark dataframe creation through already distributed in-memory data sets

2020-06-16 Thread Tanveer Ahmad - EWI
Hi all, I am new to the Spark community. Please ignore if this question doesn't make sense. My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec). Explanation: I have a huge Arrow RecordBatches collection which is equally

Check point storage and its redundancy

2020-06-16 Thread shensonj
Hello All, I am trying to enable checkpoint storage for my spark streaming application running in kubernetes. What would be the best storage location for checkpoint & how to handle its redundancy -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/