Thank you so much guys for helping me, but I have some more questions about it!
Do we have to presort the columns to get the benefits of the run length encoding or do I have to group the data first and wrap it into a case class? I try to sort the data first and write it down and I get different sizes as result: 65.191.222 Bytes unsorted 62.576.598 Bytes sorted I see no run time encoding in the debug output: 14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.572.354B for [col1] INT64: 683.189 values, 5.465.512B raw, 4.572.211B comp, 6 pages, encodings: [PLAIN, BIT_PACKED] 14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.687.432B for [col2] INT64: 683.189 values, 5.465.512B raw, 4.687.289B comp, 6 pages, encodings: [PLAIN, BIT_PACKED] 14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 847.267B for [col3] INT32: 683.189 values, 852.104B raw, 847.198B comp, 3 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 713 entries, 2.852B raw, 713B comp} 14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 796.082B for [col4] INT32: 683.189 values, 907.744B raw, 796.013B comp, 3 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1.262 entries, 5.048B raw, 1.262B comp} By the way why is the schema wrong? I include there repeated values, I'm very confused! Thanks Matthes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15344.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org