subject:"groupBy and store in parquet"

Re: groupBy and store in parquet

2016-05-12 Thread Michal Vince

Hi Xinh sorry for my late reply it`s slow because of two reasons (at least to my knowledge) 1. lots of IOs - writing as json, then reading and writing again as parquet 2. because of nested rdd I can`t run the cycle and filter by event_type in parallel - this applies to your solution (3rd

Re: groupBy and store in parquet

2016-05-05 Thread Xinh Huynh

Hi Michal, Why is your solution so slow? Is it from the file IO caused by storing in a temp file as JSON and then reading it back in and writing it as Parquet? How are you getting "events" in the first place? Do you have the original Kafka messages as an RDD[String]? Then how about: 1. Start

Re: groupBy and store in parquet

2016-05-05 Thread Michal Vince

Hi Xinh For (1) the biggest problem are those null columns. e.g. DF will have ~1000 columns so every partition of that DF will have ~1000 columns, one of the partitioned columns can have 996 null columns which is big waste of space (in my case more than 80% in avg) for (2) I can`t really

Re: groupBy and store in parquet

2016-05-04 Thread Xinh Huynh

Hi Michal, For (1), would it be possible to partitionBy two columns to reduce the size? Something like partitionBy("event_type", "date"). For (2), is there a way to separate the different event types upstream, like on different Kafka topics, and then process them separately? Xinh On Wed, May

groupBy and store in parquet

2016-05-04 Thread Michal Vince

Hi guys I`m trying to store kafka stream with ~5k events/s as efficiently as possible in parquet format to hdfs. I can`t make any changes to kafka (belongs to 3rd party) Events in kafka are in json format, but the problem is there are many different event types (from different subsystems

Re: groupBy and store in parquet

Re: groupBy and store in parquet

Re: groupBy and store in parquet

Re: groupBy and store in parquet

groupBy and store in parquet

5 matches

Site Navigation

Mail list logo

Footer information