How to merge multiple rows

2018-08-22 Thread msbreuer
A dataframe with following contents is given: ID PART DETAILS 11 A1 12 A2 13 A3 21 B1 31 C1 Target format should be as following: ID DETAILS 1 A1+A2+A3 2 B1 3 C1 Note, the order of A1-3 is important. Currently I am using this alternative: ID DETAIL_1 DETAIL_2

Spark Memory Requirement

2018-08-01 Thread msbreuer
Many threads talk about memory requirements and most often answers are, to add more memory to spark. My understanding of spark is a scaleable anyltics engine, which is able to utilize assigned resources and to calculate the correct answer. So assigning core and memory may speedup an task. I am

sorting on dataframe causes out of memory (java heap space)

2018-07-30 Thread msbreuer
While working with larger datasets I run into out of memory issues. Basically a hadoop sequence file is read, its contents are sorted and a hadoop map file is written back. Code works fine for workloads greater than 20gb. Than I changed one column in my dataset to store a large object and size of