A common use case we have been seeing for Spark SQL/Parquet is to take
semi-structured JSON data and transcode it to parquet.  Queries can then be
run over the parquet data with a huge speed up.  The nice thing about using
JSON is it doesn't require you to create POJOs and Spark SQL will
automatically infer the schema for you and create the equivalent parquet
metadata.

https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files


On Tue, Aug 26, 2014 at 11:38 AM, Jim <[email protected]> wrote:

>
> Thanks for the response.
>
> My intention is to have many unrelated datasets (not, if I understand you
> correctly, a collection of totally different objects). The datasets can be
> extremely wide (1000s of columns) and very deep (billions of rows), and
> very denormalized (single table) and I need to do quick aggregations of
> column data - hence why I though Parquet/HDFS/Spark was my best current
> choice.
>
> If ALL I had to do were aggregations I'd pick a column oriented DB like
> Vertica or Hana (or maybe Druid) but I also need to run various Machine
> Learning routines so the combination of Spark/HDFS/Parquet looked like one
> solution for both problems.
>
> Of course, I'm open to other suggestions.
>
> The example you sent looks like what I'm looking for. Thanks!
> Jim
>
>
> On 08/26/2014 02:30 PM, Dmitriy Ryaboy wrote:
>
>> 1) you don't have to shell out to a compiler to generate code... but
>> that's
>> complicated :).
>>
>> 2) Avro can be dynamic. I haven't played with that side of the world, but
>> this tutorial might help get you started:
>> https://github.com/AndreSchumacher/avro-parquet-spark-example
>>
>> 3) Do note that you should have 1 schema per dataset (maybe a schema you
>> didn't know until you started writing the dataset, but a schema
>> nonetheless). If your notion is to have a collection of totally different
>> objects, parquet is a bad choice.
>>
>> D
>>
>>
>> On Tue, Aug 26, 2014 at 11:14 AM, Jim <[email protected]> wrote:
>>
>>  Hello all,
>>>
>>> I couldn't find a user list so my apologies if this falls in the wrong
>>> place. I'm looking for a little guidance. I'm a newbie with respect to
>>> Parquet.
>>>
>>> We have a use case where we don't want concrete POJOs to represent data
>>> in
>>> our store. It's dynamic in that each data set is unique and dynamic and
>>> we
>>> need to handle incoming datasets at runtime.
>>>
>>> Examples of how to write to Parquet are sparse and all of the ones I
>>> could
>>> find assume Thrift/Avro/Protobuf IDL and generated schema and POJOs. I
>>> don't want to dynamically generate an IDL, shell out to a compiler, and
>>> classload the results in order to use Parquet. Is there an example that
>>> does what I'm looking for?
>>>
>>> Thanks
>>> Jim
>>>
>>>
>>>
>

Reply via email to