For general data access of the pre-computed aggregates (group by) you’re better 
off with Parquet. I’d only choose JSON if I needed interop with another app 
stack / language that has difficulty accessing parquet (E.g. Bulk load into 
document db…).

On a strategic level, both JSON and parquet are similar since neither give you 
good random access, so you can’t simply “update specific user Ids on new data 
coming in”. Your strategy will probably be to re-process all the users by 
loading new data and current aggregates, joining and writing a new version of 
the aggregates…

If you’re worried about update performance then you probably need to look at a 
DB that offers random write access (Cassandra, Hbase..)

-adrian




On 10/19/15, 12:31 PM, "Ewan Leith" <ewan.le...@realitymine.com> wrote:

>As Jörn says, Parquet and ORC will get you really good compression and can be 
>much faster. There also some nice additions around predicate pushdown which 
>can be great if you've got wide tables.
>
>Parquet is obviously easier to use, since it's bundled into Spark. Using ORC 
>is described here 
>http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/
>
>Thanks,
>Ewan
>
>-----Original Message-----
>From: Jörn Franke [mailto:jornfra...@gmail.com] 
>Sent: 19 October 2015 06:32
>To: Gavin Yue <yue.yuany...@gmail.com>
>Cc: user <user@spark.apache.org>
>Subject: Re: Should I convert json into parquet?
>
>
>
>Good Formats are Parquet or ORC. Both can be useful with compression, such as 
>Snappy.   They are much faster than JSON. however, the table structure is up 
>to you and depends on your use case.
>
>> On 17 Oct 2015, at 23:07, Gavin Yue <yue.yuany...@gmail.com> wrote:
>> 
>> I have json files which contains timestamped events.  Each event associate 
>> with a user id. 
>> 
>> Now I want to group by user id. So converts from
>> 
>> Event1 -> UserIDA;
>> Event2 -> UserIDA;
>> Event3 -> UserIDB;
>> 
>> To intermediate storage. 
>> UserIDA -> (Event1, Event2...)
>> UserIDB-> (Event3...)
>> 
>> Then I will label positives and featurize the Events Vector in many 
>> different ways, fit each of them into the Logistic Regression. 
>> 
>> I want to save intermediate storage permanently since it will be used many 
>> times.  And there will new events coming every day. So I need to update this 
>> intermediate storage every day. 
>> 
>> Right now I store intermediate storage using Json files.  Should I use 
>> Parquet instead?  Or is there better solutions for this use case?
>> 
>> Thanks a lot !
>> 
>> 
>> 
>> 
>> 
>> 
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
>commands, e-mail: user-h...@spark.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to