Re: Spark on Kudu

2016-10-10 Thread Mark Hamstra
I realize that the Spark on Kudu work to date has been based on Spark 1.6,
where your statement about Spark SQL relying on Hive is true.  In Spark
2.0, however, that dependency no longer exists since Spark SQL essentially
copied over the parts of Hive that were needed into Spark itself, and has
been free to diverge since then.

On Mon, Oct 10, 2016 at 4:11 PM, Dan Burkert  wrote:

> Hi Ben,
>
> SparkSQL relies on Hive for DDL statements, so having support for this
> requires adding support to Hive for manipulating Kudu tables.  This is
> something that we would like to do in the long term, but there are no
> concrete plans (that I know of) to make it happen in the near term.
>
> - Dan
>
> On Thu, Oct 6, 2016 at 4:38 PM, Benjamin Kim  wrote:
>
>> Anyone know if the Spark package will ever allow for creating tables in
>> Spark SQL?
>>
>> Such as:
>>CREATE EXTERNAL TABLE 
>>USING org.apache.kudu.spark.kudu
>>OPTIONS (Map("kudu.master" -> “", "kudu.table" ->
>> “table-name”));
>>
>> In this way, plain SQL can be used to do DDL, DML statements whether in
>> Spark SQL code or using JDBC to interface with Spark SQL Thriftserver.
>>
>> By the way, we are trying to create a DMP in Kudu with the a farm of
>> RESTful Endpoints to do cookie sync, ad serving, segmentation data
>> exchange. And, the Spark compute cluster and the Kudu cluster will reside
>> on the same racks in the same datacenter.
>>
>> Thanks,
>> Ben
>>
>> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell 
>> wrote:
>>
>> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
>>
>> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim  wrote:
>>
>>> I see that the API has changed a bit so my old code doesn’t work
>>> anymore. Can someone direct me to some code samples?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon  wrote:
>>>
>>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim  wrote
>>> :
>>>
 Now that Kudu 1.0.0 is officially out and ready for production use,
 where do we find the spark connector jar for this release?


>>> It's available in the official ASF maven repository:
>>> https://repository.apache.org/#nexus-search;quick~kudu-spark
>>>
>>> 
>>>   org.apache.kudu
>>>   kudu-spark_2.10
>>>   1.0.0
>>> 
>>>
>>>
>>> -Todd
>>>
>>>
>>>
 On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:

 Hi Ben,

 To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL,
 I do not think we support that at this point.  I haven't looked deeply into
 it, but we may hit issues specifying Kudu-specific options (partitioning,
 column encoding, etc.).  Probably issues that can be worked through
 eventually, though.  If you are interested in contributing to Kudu, this is
 an area that could obviously use improvement!  Most or all of our Spark
 features have been completely community driven to date.


> I am assuming that more Spark support along with semantic changes
> below will be incorporated into Kudu 0.9.1.
>

 As a rule we do not release new features in patch releases, but the
 good news is that we are releasing regularly, and our next scheduled
 release is for the August timeframe (see JD's roadmap
 
  email
 about what we are aiming to include).  Also, Cloudera does publish snapshot
 versions of the Spark connector here
 ,
 so the jars are available if you don't mind using snapshots.


> Anyone know of a better way to make unique primary keys other than
> using UUID to make every row unique if there is no unique column (or
> combination thereof) to use.
>

 Not that I know of.  In general it's pretty rare to have a dataset
 without a natural primary key (even if it's just all of the columns), but
 in those cases UUID is a good solution.


> This is what I am using. I know auto incrementing is coming down the
> line (don’t know when), but is there a way to simulate this in Kudu using
> Spark out of curiosity?
>

 To my knowledge there is no plan to have auto increment in Kudu.
 Distributed, consistent, auto incrementing counters is a difficult problem,
 and I don't think there are any known solutions that would be fast enough
 for Kudu (happy to be proven wrong, though!).

 - Dan


>
> Thanks,
> Ben
>
> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
>
> I'm not sure exactly what the semantics will be, but at least one of
> them will be upsert.  These modes come from spark, and they were 

Re: Schema Normalization

2016-10-10 Thread Todd Lipcon
On Mon, Oct 10, 2016 at 4:44 PM, Benjamin Kim  wrote:

> Todd,
>
> We are not going crazy with normalization. Actually, we are only
> normalizing where necessary. For example, we have a table for profiles and
> behaviors. They are joined together by a behavior status table. Each one of
> these tables are de-normalized when it comes to basic attributes. That’s
> the extent of it. From the sound of it, it looks like we are good for now.
>

Yea, sounds good.

One thing to keep an eye on is
https://issues.cloudera.org/browse/IMPALA-4252 if you use Impala -this
should help a lot wth joins where one side of the join has selective
predicates on a large table.

-Todd


>
> On Oct 10, 2016, at 4:15 PM, Todd Lipcon  wrote:
>
> Hey Ben,
>
> Yea, we currently don't do great with very wide tables. For example, on
> flushes, we'll separately write and fsync each of the underlying columns,
> so if you have hundreds, it can get very expensive. Another factor is that
> currently every 'Write' RPC actually contains the full schema information
> for all columns, regardless of whether you've set them for a particular row.
>
> I'm sure we'll make improvements in these areas in the coming
> months/years, but for now, the recommendation is to stick with a schema
> that looks more like an RDBMS schema than an HBase one.
>
> However, I wouldn't go _crazy_ on normalization. For example, I wouldn't
> bother normalizing out a 'date' column into a 'date_id' and separate
> 'dates' table, as one might have done in a fully normalized RDBMS table in
> days of yore. Kudu's columnar layout, in conjunction with encodings like
> dictionary encoding, make that kind of normalization ineffective or even
> counter-productive as they introduce extra joins and query-time complexity.
>
> One other item to note is that with more normalized schemas, it requires
> more of your query engine's planning capabilities. If you aren't doing
> joins, a very dumb query planner is fine. If you're doing complex joins
> across 10+ tables, then the quality of plans makes an enormous difference
> in query performance. To speak in concrete terms, I would guess that with
> more heavily normalized schemas, Impala's query planner would do a lot
> better job than Spark's, given that we don't currently expose information
> on table sizes to Spark and thus it's likely to do a poor job of join
> ordering.
>
> Hope that helps
>
> -Todd
>
>
> On Fri, Oct 7, 2016 at 7:47 PM, Benjamin Kim  wrote:
>
>> I would like to know if normalization techniques should or should not be
>> necessary when modeling table schemas in Kudu. I read that a table with
>> around 50 columns is ideal. This would mean a very wide table should be
>> avoided.
>>
>> Thanks,
>> Ben
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera