Re: Intro & kudu-spark support for persisting DataFrames

Mike Percy Thu, 05 May 2016 20:22:22 -0700

Hey Andy, thanks for posting your code to Gerrit. It's great to see more
work progressing on the Spark side of things!


Some other stuff you guys might want to consider for the create table
feature, in addition to the partitioning details, is the ability to specify
replication factor. In fact, one might want to specify any of the stuff in
the CreateTableRequestPB
<https://github.com/apache/incubator-kudu/blob/master/src/kudu/master/master.proto#L337>
:

message CreateTableRequestPB {
  required string name = 1;
  required SchemaPB schema = 2;
  optional RowOperationsPB split_rows = 6;
  optional PartitionSchemaPB partition_schema = 7;
  optional int32 num_replicas = 4;
}

Ideally you could even specify encodings for the columns (dictionary
encoding, etc) as part of the schema.

All of this does seem orthogonal to the DataFrame writing capability, which
is probably more straightforward in terms of features. Consider treating
these as two separate patches: one that assumes the table already exists,
and one that adds the table creation capability. Gerrit will treat two
separate commits as two separate code reviews, even if one depends on the
other (it can track that). Doing that will make it easier to review and
probably faster to get the different parts of it committed.

Mike


--
Mike Percy
Software Engineer, Cloudera


On Thu, May 5, 2016 at 7:52 PM, Andy Grove <[email protected]> wrote:

> Hi Dan,
>
> Thanks for the feedback. I'm thinking that the PK could be specified by
> passing in a tuple of column indices to use for the key (assuming that
> composite keys are supported?).
>
> Maybe it would be useful for KuduContext to have createTable(name: String,
> schema: StructType) and dropTable(name: String) methods that can eventually
> be wrapped in Spark SQL?
>
> I'm just getting up to speed with Kudu by trying to run TPC-H via Spark and
> the use case for driving the current code was being able to take existing
> DataFrames (created by reading CSV files) and save them to Kudu so that we
> can then start testing the TPC-H SQL queries.
>
> We will be experimenting with both Spark SQL and own own SQL parser / query
> planner for Spark, that we are planning on open sourcing too.
>
> Thanks,
>
> Andy.
>
> On Thu, May 5, 2016 at 7:16 PM, Dan Burkert <[email protected]> wrote:
>
> > Hey Andy,
> >
> > Thanks for the patch!  I left you some specific feedback in on the gerrit
> > review, but I want to discuss the high level approach a bit.  I think the
> > patch as it's written now is going to have limited use, because it
> doesn't
> > allow for specifying primary keys or partitioning, which are critical for
> > correctness and performance. In the long run we will definitely want to
> be
> > able to create tables through Spark SQL, but perhaps we should start of
> > with just inserting/updating rows in existing tables.  It would be
> > interesting to see how other databases solved this problem, since I'm
> sure
> > we're not the only ones with configuration options on table create.  The
> > relational databases in particular must have PK options.
> >
> > - Dan
> >
> > On Thu, May 5, 2016 at 5:51 PM, Andy Grove <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > I'm working with some colleagues at AgilData on Spark/Kudu integration
> > and
> > > we expect to be able to contribute a number of features to the code
> base.
> > >
> > > To kick things off, here is a gerrit for discussion that adds support
> for
> > > persisting a DataFrame to a Kudu table. It would be great to hear
> > feedback
> > > and feature requests for this capability.
> > >
> > > http://gerrit.cloudera.org:8080/#/c/2969/
> > >
> > > Thanks,
> > >
> > > Andy.
> > >
> >
>

Re: Intro & kudu-spark support for persisting DataFrames

Reply via email to