Re: Schema-less or dynamic schema use

Dmitriy Ryaboy Tue, 26 Aug 2014 12:32:09 -0700

Nice -- using Spark to infer the json schema. Also a good way to do that.
Does it handle nesting and everything?



On Tue, Aug 26, 2014 at 12:16 PM, Michael Armbrust <[email protected]>
wrote:

> A common use case we have been seeing for Spark SQL/Parquet is to take
> semi-structured JSON data and transcode it to parquet.  Queries can then be
> run over the parquet data with a huge speed up.  The nice thing about using
> JSON is it doesn't require you to create POJOs and Spark SQL will
> automatically infer the schema for you and create the equivalent parquet
> metadata.
>
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
>
>
> On Tue, Aug 26, 2014 at 11:38 AM, Jim <[email protected]> wrote:
>
> >
> > Thanks for the response.
> >
> > My intention is to have many unrelated datasets (not, if I understand you
> > correctly, a collection of totally different objects). The datasets can
> be
> > extremely wide (1000s of columns) and very deep (billions of rows), and
> > very denormalized (single table) and I need to do quick aggregations of
> > column data - hence why I though Parquet/HDFS/Spark was my best current
> > choice.
> >
> > If ALL I had to do were aggregations I'd pick a column oriented DB like
> > Vertica or Hana (or maybe Druid) but I also need to run various Machine
> > Learning routines so the combination of Spark/HDFS/Parquet looked like
> one
> > solution for both problems.
> >
> > Of course, I'm open to other suggestions.
> >
> > The example you sent looks like what I'm looking for. Thanks!
> > Jim
> >
> >
> > On 08/26/2014 02:30 PM, Dmitriy Ryaboy wrote:
> >
> >> 1) you don't have to shell out to a compiler to generate code... but
> >> that's
> >> complicated :).
> >>
> >> 2) Avro can be dynamic. I haven't played with that side of the world,
> but
> >> this tutorial might help get you started:
> >> https://github.com/AndreSchumacher/avro-parquet-spark-example
> >>
> >> 3) Do note that you should have 1 schema per dataset (maybe a schema you
> >> didn't know until you started writing the dataset, but a schema
> >> nonetheless). If your notion is to have a collection of totally
> different
> >> objects, parquet is a bad choice.
> >>
> >> D
> >>
> >>
> >> On Tue, Aug 26, 2014 at 11:14 AM, Jim <[email protected]> wrote:
> >>
> >>  Hello all,
> >>>
> >>> I couldn't find a user list so my apologies if this falls in the wrong
> >>> place. I'm looking for a little guidance. I'm a newbie with respect to
> >>> Parquet.
> >>>
> >>> We have a use case where we don't want concrete POJOs to represent data
> >>> in
> >>> our store. It's dynamic in that each data set is unique and dynamic and
> >>> we
> >>> need to handle incoming datasets at runtime.
> >>>
> >>> Examples of how to write to Parquet are sparse and all of the ones I
> >>> could
> >>> find assume Thrift/Avro/Protobuf IDL and generated schema and POJOs. I
> >>> don't want to dynamically generate an IDL, shell out to a compiler, and
> >>> classload the results in order to use Parquet. Is there an example that
> >>> does what I'm looking for?
> >>>
> >>> Thanks
> >>> Jim
> >>>
> >>>
> >>>
> >
>

Re: Schema-less or dynamic schema use

Reply via email to