Re: groupBy and store in parquet

2016-05-12 Thread Michal Vince
Hi Xinh sorry for my late reply it`s slow because of two reasons (at least to my knowledge) 1. lots of IOs - writing as json, then reading and writing again as parquet 2. because of nested rdd I can`t run the cycle and filter by event_type in parallel - this applies to your solution (3rd

Re: groupBy and store in parquet

2016-05-05 Thread Xinh Huynh
Hi Michal, Why is your solution so slow? Is it from the file IO caused by storing in a temp file as JSON and then reading it back in and writing it as Parquet? How are you getting "events" in the first place? Do you have the original Kafka messages as an RDD[String]? Then how about: 1. Start

Re: groupBy and store in parquet

2016-05-05 Thread Michal Vince
Hi Xinh For (1) the biggest problem are those null columns. e.g. DF will have ~1000 columns so every partition of that DF will have ~1000 columns, one of the partitioned columns can have 996 null columns which is big waste of space (in my case more than 80% in avg) for (2) I can`t really

Re: groupBy and store in parquet

2016-05-04 Thread Xinh Huynh
Hi Michal, For (1), would it be possible to partitionBy two columns to reduce the size? Something like partitionBy("event_type", "date"). For (2), is there a way to separate the different event types upstream, like on different Kafka topics, and then process them separately? Xinh On Wed, May

groupBy and store in parquet

2016-05-04 Thread Michal Vince
Hi guys I`m trying to store kafka stream with ~5k events/s as efficiently as possible in parquet format to hdfs. I can`t make any changes to kafka (belongs to 3rd party) Events in kafka are in json format, but the problem is there are many different event types (from different subsystems