As Jörn says, Parquet and ORC will get you really good compression and can be 
much faster. There also some nice additions around predicate pushdown which can 
be great if you've got wide tables.

Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is 
described here 
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/

Thanks,
Ewan

-----Original Message-----
From: Jörn Franke [mailto:jornfra...@gmail.com] 
Sent: 19 October 2015 06:32
To: Gavin Yue <yue.yuany...@gmail.com>
Cc: user <user@spark.apache.org>
Subject: Re: Should I convert json into parquet?



Good Formats are Parquet or ORC. Both can be useful with compression, such as 
Snappy.   They are much faster than JSON. however, the table structure is up to 
you and depends on your use case.

> On 17 Oct 2015, at 23:07, Gavin Yue <yue.yuany...@gmail.com> wrote:
> 
> I have json files which contains timestamped events.  Each event associate 
> with a user id. 
> 
> Now I want to group by user id. So converts from
> 
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> 
> To intermediate storage. 
> UserIDA -> (Event1, Event2...)
> UserIDB-> (Event3...)
> 
> Then I will label positives and featurize the Events Vector in many different 
> ways, fit each of them into the Logistic Regression. 
> 
> I want to save intermediate storage permanently since it will be used many 
> times.  And there will new events coming every day. So I need to update this 
> intermediate storage every day. 
> 
> Right now I store intermediate storage using Json files.  Should I use 
> Parquet instead?  Or is there better solutions for this use case?
> 
> Thanks a lot !
> 
> 
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to