Re: Spark on Kudu

2016-06-15 Thread Benjamin Kim
Since I have created permanent tables using org.apache.spark.sql.jdbc and com.databricks.spark.csv with sqlContext, I was wondering if I can do the same with Kudu tables? CREATE TABLE USING org.kududb.spark.kudu OPTIONS ("kudu.master” "kudu_master","kudu.table” "kudu_tablename”) Is this

Re: Kudu QuickStart VM 0.9.0?

2016-06-15 Thread Jean-Daniel Cryans
It's up, it's a different filename since I also upgraded from 5.4.9 to 5.7.1 so you'll need to update your kudu-examples repo first. J-D On Wed, Jun 15, 2016 at 10:19 AM, Tom White wrote: > Thanks J-D. > > Tom > > On Wed, Jun 15, 2016 at 6:06 PM, Jean-Daniel Cryans

Re: Performance Question

2016-06-15 Thread Dan Burkert
Adding partition splits when range partitioning is done via the CreateTableOptions.addSplitRow method. You can find more about the different partitioning options in the schema design

Kudu QuickStart VM 0.9.0?

2016-06-15 Thread Tom White
Hi, I tried downloading the VM for the new release, but it looks like it's still on 0.7.0: https://github.com/cloudera/kudu-examples/commit/9a22e9f6280094f029c049a7776cce3458150e7f Are there plans to update it? I find it very useful for trying out Kudu. Thanks! Tom

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Todd, I think the locality is not within our setup. We have the compute cluster with Spark, YARN, etc. on its own, and we have the storage cluster with HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute cluster and beefed up storage capacity on the storage cluster. We

Re: Performance Question

2016-06-15 Thread Todd Lipcon
Awesome use case. One thing to keep in mind is that spark parallelism will be limited by the number of tablets. So, you might want to split into 10 or so buckets per node to get the best query throughput. Usually if you run top on some machines while running the query you can see if it is fully

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Hi Todd, I did a simple test of our ad events. We stream using Spark Streaming directly into HBase, and the Data Analysts/Scientists do some insight/discovery work plus some reports generation. For the reports, we use SQL, and the more deeper stuff, we use Spark. In Spark, our main data

Re: Performance Question

2016-06-15 Thread Todd Lipcon
Hi Benjamin, What workload are you using for benchmarks? Using spark or something more custom? rdd or data frame or SQL, etc? Maybe you can share the schema and some queries Todd Todd On Jun 15, 2016 8:10 AM, "Benjamin Kim" wrote: > Hi Todd, > > Now that Kudu 0.9.0 is out.

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Hi Todd, Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. Compared to HBase, read and write performance are better. Write performance has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only preliminary tests. Do you know of a way to really