Nice -- using Spark to infer the json schema. Also a good way to do that. Does it handle nesting and everything?
On Tue, Aug 26, 2014 at 12:16 PM, Michael Armbrust <[email protected]> wrote: > A common use case we have been seeing for Spark SQL/Parquet is to take > semi-structured JSON data and transcode it to parquet. Queries can then be > run over the parquet data with a huge speed up. The nice thing about using > JSON is it doesn't require you to create POJOs and Spark SQL will > automatically infer the schema for you and create the equivalent parquet > metadata. > > > https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files > > > On Tue, Aug 26, 2014 at 11:38 AM, Jim <[email protected]> wrote: > > > > > Thanks for the response. > > > > My intention is to have many unrelated datasets (not, if I understand you > > correctly, a collection of totally different objects). The datasets can > be > > extremely wide (1000s of columns) and very deep (billions of rows), and > > very denormalized (single table) and I need to do quick aggregations of > > column data - hence why I though Parquet/HDFS/Spark was my best current > > choice. > > > > If ALL I had to do were aggregations I'd pick a column oriented DB like > > Vertica or Hana (or maybe Druid) but I also need to run various Machine > > Learning routines so the combination of Spark/HDFS/Parquet looked like > one > > solution for both problems. > > > > Of course, I'm open to other suggestions. > > > > The example you sent looks like what I'm looking for. Thanks! > > Jim > > > > > > On 08/26/2014 02:30 PM, Dmitriy Ryaboy wrote: > > > >> 1) you don't have to shell out to a compiler to generate code... but > >> that's > >> complicated :). > >> > >> 2) Avro can be dynamic. I haven't played with that side of the world, > but > >> this tutorial might help get you started: > >> https://github.com/AndreSchumacher/avro-parquet-spark-example > >> > >> 3) Do note that you should have 1 schema per dataset (maybe a schema you > >> didn't know until you started writing the dataset, but a schema > >> nonetheless). If your notion is to have a collection of totally > different > >> objects, parquet is a bad choice. > >> > >> D > >> > >> > >> On Tue, Aug 26, 2014 at 11:14 AM, Jim <[email protected]> wrote: > >> > >> Hello all, > >>> > >>> I couldn't find a user list so my apologies if this falls in the wrong > >>> place. I'm looking for a little guidance. I'm a newbie with respect to > >>> Parquet. > >>> > >>> We have a use case where we don't want concrete POJOs to represent data > >>> in > >>> our store. It's dynamic in that each data set is unique and dynamic and > >>> we > >>> need to handle incoming datasets at runtime. > >>> > >>> Examples of how to write to Parquet are sparse and all of the ones I > >>> could > >>> find assume Thrift/Avro/Protobuf IDL and generated schema and POJOs. I > >>> don't want to dynamically generate an IDL, shell out to a compiler, and > >>> classload the results in order to use Parquet. Is there an example that > >>> does what I'm looking for? > >>> > >>> Thanks > >>> Jim > >>> > >>> > >>> > > >
