Re: Spark on Kudu

Benjamin Kim Wed, 15 Jun 2016 22:54:09 -0700

Since I have created permanent tables using org.apache.spark.sql.jdbc and 
com.databricks.spark.csv with sqlContext, I was wondering if I can do the same 
with Kudu tables?


CREATE TABLE <tablename>
USING org.kududb.spark.kudu
OPTIONS ("kudu.master” "kudu_master","kudu.table” "kudu_tablename”)

Is this possible? By the way, the above didn’t work for me.

Thanks,
Ben

> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote:
> 
> I'm not sure exactly what the semantics will be, but at least one of them 
> will be upsert.  These modes come from spark, and they were really designed 
> for file-backed storage and not table storage.  We may want to do append = 
> upsert, and overwrite = truncate + insert.  I think that may match the normal 
> spark semantics more closely.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks for the information. That would mean both “append” and “overwrite” 
> modes would be combined or not needed in the future.
> 
> Cheers,
> Ben
> 
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Right now append uses an update Kudu operation, which requires the row 
>> already be present in the table. Overwrite maps to insert.  Kudu very 
>> recently got upsert support baked in, but it hasn't yet been integrated into 
>> the Spark connector.  So pretty soon these sharp edges will get a lot 
>> better, since upsert is the way to go for most spark workloads.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>> data. Now, I have to find answers to these questions. What would happen if I 
>> “append” to the data in the Kudu table if the data already exists? What 
>> would happen if I “overwrite” existing data when the DataFrame has data in 
>> it that does not exist in the Kudu table? I need to evaluate the best way to 
>> simulate the UPSERT behavior in HBase because this is what our use case is.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Now, I’m getting this error when trying to write to the table.
>>> 
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>> 
>>> df.write
>>>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>>     .mode("overwrite")
>>>     .kudu
>>> 
>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>> (error 0)Not found: key not found (error 0)
>>> 
>>> Does the key field need to be first in the DataFrame?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com 
>>>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>> 
>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>> 
>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>> using setRangePartitionColumns or addHashPartitions
>>>> 
>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>> primary key columns, so in this case you have specified the single PK 
>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>> the table, in this case over the column "my_id" (which is good, it must be 
>>>> over one or more PK columns, so in this case "my_id" is the one and only 
>>>> valid combination).  However, the call to `addHashPartition` also takes 
>>>> the number of buckets as the second param.  You shouldn't get the 
>>>> IllegalArgumentException as long as you are specifying either 
>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>> 
>>>> - Dan
>>>>  
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com 
>>>>> <mailto:d...@cloudera.com>> wrote:
>>>>> 
>>>>> Looks like we're missing an import statement in that example.  Could you 
>>>>> try:
>>>>> 
>>>>> import org.kududb.client._
>>>>> and try again?
>>>>> 
>>>>> - Dan
>>>>> 
>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> I encountered an error trying to create a table based on the 
>>>>> documentation from a DataFrame.
>>>>> 
>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>               kuduContext.createTable(tableName, df.schema, Seq("key"), 
>>>>> new CreateTableOptions().setNumReplicas(1))
>>>>> 
>>>>> Is there something I’m missing?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>> 
>>>>>> It's only in Cloudera's maven repo: 
>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>  
>>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>>>> 
>>>>>> J-D
>>>>>> 
>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> Hi J-D,
>>>>>> 
>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>>>>>> spark-shell to use. Can you show me where to find it?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>> 
>>>>>>> What's in this doc is what's gonna get released: 
>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>>>  
>>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>> 
>>>>>>>> It will be in 0.9.0.
>>>>>>>> 
>>>>>>>> J-D
>>>>>>>> 
>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>> Hi Chris,
>>>>>>>> 
>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com 
>>>>>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>>>>> 
>>>>>>>>> There is some code in review that needs some more refinement.
>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource 
>>>>>>>>> api. It will also allow the creation and deletion of tables from a 
>>>>>>>>> dataframe
>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>>>>> 
>>>>>>>>> Example usages will look something like:
>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>>>>> 
>>>>>>>>> -Chris George
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>>>> 
>>>>>>>>> Also, does anyone have any sample code on how to update/insert data 
>>>>>>>>> in Kudu using DataFrames?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George 
>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> SparkSQL cannot support these type of statements but we may be able 
>>>>>>>>>> to implement similar functionality through the api.
>>>>>>>>>> -Chris
>>>>>>>>>> 
>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” 
>>>>>>>>>> if it were to be implemented.
>>>>>>>>>> 
>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George 
>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>>>>>>>> gerrit if you want to take a look. 
>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>>>>> It does pushdown predicates which the existing input formatter 
>>>>>>>>>>> based rdd does not.
>>>>>>>>>>> 
>>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource 
>>>>>>>>>>> for spark that will have pushdown predicates and insertion/update 
>>>>>>>>>>> functionality (need to look more at cassandra and the hbase 
>>>>>>>>>>> datasource for best way to do this) I agree that server side upsert 
>>>>>>>>>>> would be helpful.
>>>>>>>>>>> Having a datasource would give us useful data frames and also make 
>>>>>>>>>>> spark sql usable for kudu.
>>>>>>>>>>> 
>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala is: 
>>>>>>>>>>> 1. We have had trouble getting impala to run fast with high 
>>>>>>>>>>> concurrency when compared to spark 2. We interact with datasources 
>>>>>>>>>>> which do not integrate with impala. 3. We have custom sql query 
>>>>>>>>>>> planners for extended sql functionality.
>>>>>>>>>>> 
>>>>>>>>>>> -Chris George
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> You guys make a convincing point, although on the upsert side we'll 
>>>>>>>>>>> need more support from the servers. Right now all you can do is an 
>>>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could 
>>>>>>>>>>> at least add an API on the client side that would manage it, but it 
>>>>>>>>>>> wouldn't be atomic.
>>>>>>>>>>> 
>>>>>>>>>>> J-D
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>>>>> It's pretty simple, actually.  I need to support versioned datasets 
>>>>>>>>>>> in a Spark SQL environment.  Instead of a hack on top of a Parquet 
>>>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use 
>>>>>>>>>>> Kudu's write and timestamp-based read operations to support not 
>>>>>>>>>>> only appending data, but also updating existing data, and even some 
>>>>>>>>>>> schema migration.  The most typical use case is a dataset that is 
>>>>>>>>>>> updated periodically (e.g., weekly or monthly) in which the the 
>>>>>>>>>>> preliminary data in the previous window (week or month) is updated 
>>>>>>>>>>> with values that are expected to remain unchanged from then on, and 
>>>>>>>>>>> a new set of preliminary values for the current window need to be 
>>>>>>>>>>> added/appended.
>>>>>>>>>>> 
>>>>>>>>>>> Using Kudu's Java API and developing additional functionality on 
>>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease 
>>>>>>>>>>> of integration with Spark SQL will gate how quickly we would move 
>>>>>>>>>>> to using Kudu and how seriously we'd look at alternatives before 
>>>>>>>>>>> making that decision. 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>> Mark,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it caught 
>>>>>>>>>>> the attention of other folks!
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
>>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>>> 
>>>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently 
>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert 
>>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu.  
>>>>>>>>>>> Whether Kudu does a good job supporting inserts with Spark SQL will 
>>>>>>>>>>> be a key consideration as to whether we adopt Kudu.
>>>>>>>>>>> 
>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for 
>>>>>>>>>>> you. Is it just that you currently do it that way into some 
>>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to 
>>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and directly 
>>>>>>>>>>> use the Java API's KuduSession be too much work?
>>>>>>>>>>> 
>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your 
>>>>>>>>>>> current solution? If it's not completely clear, I'd love to help 
>>>>>>>>>>> you think through it.
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>> 
>>>>>>>>>>> What are your DS folks looking for in terms of functionality 
>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully featured 
>>>>>>>>>>> as Impala's? Do they care being able to insert into Kudu with 
>>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more 
>>>>>>>>>>> specific to Spark that I'm missing?
>>>>>>>>>>> 
>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At 
>>>>>>>>>>> Cloudera all our resources are committed to making things happen in 
>>>>>>>>>>> time, and a more fully featured Spark integration isn't in our 
>>>>>>>>>>> plans during that period. I'm really hoping someone in the 
>>>>>>>>>>> community will help with Spark, the same way we got a big 
>>>>>>>>>>> contribution for the Flume sink. 
>>>>>>>>>>> 
>>>>>>>>>>> J-D
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, 
>>>>>>>>>>> since it’s not “production-ready”, upper management doesn’t want to 
>>>>>>>>>>> fully deploy it yet. They just want to keep an eye on it though. 
>>>>>>>>>>> Kudu was so much simpler and easier to use in every aspect compared 
>>>>>>>>>>> to HBase. Impala was great for the report writers and analysts to 
>>>>>>>>>>> experiment with for the short time it was up. But, once again, the 
>>>>>>>>>>> only blocker was the lack of Spark support for our Data 
>>>>>>>>>>> Developers/Scientists. So, production-level data population won’t 
>>>>>>>>>>> happen until then.
>>>>>>>>>>> 
>>>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>> J-D,
>>>>>>>>>>>> 
>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an updatable 
>>>>>>>>>>>> hot data store to ensure that duplicates are taken care of and 
>>>>>>>>>>>> idempotency is maintained. Whether data was directly retrieved 
>>>>>>>>>>>> from Cassandra for analytics, reports, or searches, it was not 
>>>>>>>>>>>> clear as to what was its main use. Some also just used it for a 
>>>>>>>>>>>> staging area to populate downstream tables in parquet format. The 
>>>>>>>>>>>> last thing I heard was that CQL was terrible, so that rules out 
>>>>>>>>>>>> much use of direct queries against it.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real 
>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. 
>>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for 
>>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former.
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> As for our company, we have been looking for an updatable data 
>>>>>>>>>>>> store for a long time that can be quickly queried directly either 
>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still 
>>>>>>>>>>>> handle TB or PB of data without performance degradation and many 
>>>>>>>>>>>> configuration headaches. For now, we are using HBase to take on 
>>>>>>>>>>>> this role with Phoenix as a fast way to directly query the data. I 
>>>>>>>>>>>> can see Kudu as the best way to fill this gap easily, especially 
>>>>>>>>>>>> being the closest thing to other relational databases out there in 
>>>>>>>>>>>> familiarity for the many SQL analytics people in our company. The 
>>>>>>>>>>>> other alternative would be to go with AWS Redshift for the same 
>>>>>>>>>>>> reasons, but it would come at a cost, of course. If we went with 
>>>>>>>>>>>> either solutions, Kudu or Redshift, it would get rid of the need 
>>>>>>>>>>>> to extract from HBase to parquet tables or export to PostgreSQL to 
>>>>>>>>>>>> support more of the SQL language using by analysts or the 
>>>>>>>>>>>> reporting software we use..
>>>>>>>>>>>> 
>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with 
>>>>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use 
>>>>>>>>>>>> cases?
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>> 
>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to 
>>>>>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. 
>>>>>>>>>>>>> My colleagues who were also there did say that the hype around 
>>>>>>>>>>>>> Spark isn't dying down.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, 
>>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* 
>>>>>>>>>>>>> is just an interim solution for the use case you describe.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a 
>>>>>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see 
>>>>>>>>>>>>> more contributions on the Spark front. I know there's code out 
>>>>>>>>>>>>> there that could be integrated in kudu-spark, it just needs to 
>>>>>>>>>>>>> land in gerrit. I'm sure folks will happily review it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to learn 
>>>>>>>>>>>>> more about the use cases for which you envision using Kudu as a 
>>>>>>>>>>>>> C* replacement.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They 
>>>>>>>>>>>>> told me that everything was about Spark and there is a big buzz 
>>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I 
>>>>>>>>>>>>> still think that Cassandra is just an interim solution as a 
>>>>>>>>>>>>> low-latency, easily queried data store. I was wondering if 
>>>>>>>>>>>>> anything significant happened in regards to Kudu, especially on 
>>>>>>>>>>>>> the Spark front. Plus, can you come up with your own proposed 
>>>>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I 
>>>>>>>>>>>>>> know of one person on the Kudu Slack who's working on a better 
>>>>>>>>>>>>>> RDD, but that's about it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>>>>>>>>>>>> <mailto:b...@amobee.com>> wrote:
>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target 
>>>>>>>>>>>>>> a version of Kudu to begin real testing of Spark against it for 
>>>>>>>>>>>>>> our devs. At least, I can tell them what timeframe to anticipate.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's 
>>>>>>>>>>>>>>> needed either.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally 
>>>>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of 
>>>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The goal was to provide something for others to contribute to. 
>>>>>>>>>>>>>>> We have some basic unit tests that others can easily extend. 
>>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be really 
>>>>>>>>>>>>>>> happy to assist one improve the kudu-spark code.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim 
>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu 
>>>>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring up 
>>>>>>>>>>>>>>> more Spark SQL functionality (Dataframes) and doing the 
>>>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu 
>>>>>>>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this 
>>>>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on 
>>>>>>>>>>>>>>>> Kudu, but it will require a lot more work to make it 
>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim 
>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>>> I see this KUDU-1214 
>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 
>>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, 
>>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu both 
>>>>>>>>>>>>>>>> programmatically and as a client via Spark SQL? Or is there 
>>>>>>>>>>>>>>>> more work that needs to be done on the Spark side for it to 
>>>>>>>>>>>>>>>> work?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: Spark on Kudu

Reply via email to