Hi all,
Sorry for posting this again but I am interested in finding out what different
on disk data formats for storing timeline event and analytics aggregate data.
Currently I am just using newline delimited json gzipped files. I was wondering
if there were any recommendations.
-- Ankur
LZO compression at a minimum, and using Parquet as a second step,
seems like the way to go though I haven't tried either personally yet.
Sent from my mobile phone
On Dec 8, 2013, at 16:54, Ankur Chauhan achau...@brightcove.com wrote:
Hi all,
Sorry for posting this again but I am interested
Hi Patrick,
I agree this is a very open ended question but I was trying to get a general
answer anyway but I think you did hint on some nuances.
1. My work load is definitely bottlenecked by disk IO just beacause even with a
project on a single column(mostly 2-3 out of 20) there is a lot of
Parquet might be a good fit for you then... it's pretty new and I
don't have a lot of direct experience working with it. But I've seen
examples of people using Spark with Parquet. You might want to
checkout Matt Massie's post here:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/
This
Thanks for sharing.
On 2013-12-09 11:50 AM, Patrick Wendell pwend...@gmail.com wrote:
Parquet might be a good fit for you then... it's pretty new and I
don't have a lot of direct experience working with it. But I've seen
examples of people using Spark with Parquet. You might want to
checkout