Dan, The roadmap is very informative. I am looking forward to the official 1.0 release! It would be so much easier for us to use in every aspect compared to HBase.
Cheers, Ben > On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com> wrote: > > Hi Ben, > > To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do > not think we support that at this point. I haven't looked deeply into it, > but we may hit issues specifying Kudu-specific options (partitioning, column > encoding, etc.). Probably issues that can be worked through eventually, > though. If you are interested in contributing to Kudu, this is an area that > could obviously use improvement! Most or all of our Spark features have been > completely community driven to date. > > I am assuming that more Spark support along with semantic changes below will > be incorporated into Kudu 0.9.1. > > As a rule we do not release new features in patch releases, but the good news > is that we are releasing regularly, and our next scheduled release is for the > August timeframe (see JD's roadmap > <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E> > email about what we are aiming to include). Also, Cloudera does publish > snapshot versions of the Spark connector here > <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the > jars are available if you don't mind using snapshots. > > Anyone know of a better way to make unique primary keys other than using UUID > to make every row unique if there is no unique column (or combination > thereof) to use. > > Not that I know of. In general it's pretty rare to have a dataset without a > natural primary key (even if it's just all of the columns), but in those > cases UUID is a good solution. > > This is what I am using. I know auto incrementing is coming down the line > (don’t know when), but is there a way to simulate this in Kudu using Spark > out of curiosity? > > To my knowledge there is no plan to have auto increment in Kudu. > Distributed, consistent, auto incrementing counters is a difficult problem, > and I don't think there are any known solutions that would be fast enough for > Kudu (happy to be proven wrong, though!). > > - Dan > > > Thanks, > Ben > >> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com >> <mailto:d...@cloudera.com>> wrote: >> >> I'm not sure exactly what the semantics will be, but at least one of them >> will be upsert. These modes come from spark, and they were really designed >> for file-backed storage and not table storage. We may want to do append = >> upsert, and overwrite = truncate + insert. I think that may match the >> normal spark semantics more closely. >> >> - Dan >> >> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> Dan, >> >> Thanks for the information. That would mean both “append” and “overwrite” >> modes would be combined or not needed in the future. >> >> Cheers, >> Ben >> >>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com >>> <mailto:d...@cloudera.com>> wrote: >>> >>> Right now append uses an update Kudu operation, which requires the row >>> already be present in the table. Overwrite maps to insert. Kudu very >>> recently got upsert support baked in, but it hasn't yet been integrated >>> into the Spark connector. So pretty soon these sharp edges will get a lot >>> better, since upsert is the way to go for most spark workloads. >>> >>> - Dan >>> >>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in >>> 64s. I would assume that now I can use the “overwrite” mode on existing >>> data. Now, I have to find answers to these questions. What would happen if >>> I “append” to the data in the Kudu table if the data already exists? What >>> would happen if I “overwrite” existing data when the DataFrame has data in >>> it that does not exist in the Kudu table? I need to evaluate the best way >>> to simulate the UPSERT behavior in HBase because this is what our use case >>> is. >>> >>> Thanks, >>> Ben >>> >>> >>> >>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> >>>> Hi, >>>> >>>> Now, I’m getting this error when trying to write to the table. >>>> >>>> import scala.collection.JavaConverters._ >>>> val key_seq = Seq(“my_id") >>>> val key_list = List(“my_id”).asJava >>>> kuduContext.createTable(tableName, df.schema, key_seq, new >>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100)) >>>> >>>> df.write >>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName)) >>>> .mode("overwrite") >>>> .kudu >>>> >>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to >>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not >>>> found (error 0)Not found: key not found (error 0)Not found: key not found >>>> (error 0)Not found: key not found (error 0) >>>> >>>> Does the key field need to be first in the DataFrame? >>>> >>>> Thanks, >>>> Ben >>>> >>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com >>>>> <mailto:d...@cloudera.com>> wrote: >>>>> >>>>> >>>>> >>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> Dan, >>>>> >>>>> Thanks! It got further. Now, how do I set the Primary Key to be a >>>>> column(s) in the DataFrame and set the partitioning? Is it like this? >>>>> >>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new >>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id")) >>>>> >>>>> java.lang.IllegalArgumentException: Table partitioning must be specified >>>>> using setRangePartitionColumns or addHashPartitions >>>>> >>>>> Yep. The `Seq("my_id")` part of that call is specifying the set of >>>>> primary key columns, so in this case you have specified the single PK >>>>> column "my_id". The `addHashPartitions` call adds hash partitioning to >>>>> the table, in this case over the column "my_id" (which is good, it must >>>>> be over one or more PK columns, so in this case "my_id" is the one and >>>>> only valid combination). However, the call to `addHashPartition` also >>>>> takes the number of buckets as the second param. You shouldn't get the >>>>> IllegalArgumentException as long as you are specifying either >>>>> `addHashPartitions` or `setRangePartitionColumns`. >>>>> >>>>> - Dan >>>>> >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com >>>>>> <mailto:d...@cloudera.com>> wrote: >>>>>> >>>>>> Looks like we're missing an import statement in that example. Could you >>>>>> try: >>>>>> >>>>>> import org.kududb.client._ >>>>>> and try again? >>>>>> >>>>>> - Dan >>>>>> >>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> I encountered an error trying to create a table based on the >>>>>> documentation from a DataFrame. >>>>>> >>>>>> <console>:49: error: not found: type CreateTableOptions >>>>>> kuduContext.createTable(tableName, df.schema, Seq("key"), >>>>>> new CreateTableOptions().setNumReplicas(1)) >>>>>> >>>>>> Is there something I’m missing? >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>> >>>>>>> It's only in Cloudera's maven repo: >>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/ >>>>>>> >>>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/> >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>> Hi J-D, >>>>>>> >>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar >>>>>>> for spark-shell to use. Can you show me where to find it? >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>> >>>>>>>> What's in this doc is what's gonna get released: >>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark >>>>>>>> >>>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark> >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>> Will this be documented with examples once 0.9.0 comes out? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>> >>>>>>>>> It will be in 0.9.0. >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>> Hi Chris, >>>>>>>>> >>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George >>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> There is some code in review that needs some more refinement. >>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource >>>>>>>>>> api. It will also allow the creation and deletion of tables from a >>>>>>>>>> dataframe >>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> >>>>>>>>>> >>>>>>>>>> Example usages will look something like: >>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc> >>>>>>>>>> >>>>>>>>>> -Chris George >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com >>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> Can someone tell me what the state is of this Spark work? >>>>>>>>>> >>>>>>>>>> Also, does anyone have any sample code on how to update/insert data >>>>>>>>>> in Kudu using DataFrames? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George >>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> SparkSQL cannot support these type of statements but we may be able >>>>>>>>>>> to implement similar functionality through the api. >>>>>>>>>>> -Chris >>>>>>>>>>> >>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com >>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” >>>>>>>>>>> if it were to be implemented. >>>>>>>>>>> >>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition) >>>>>>>>>>> WHEN MATCHED THEN >>>>>>>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>>>>>>>>> WHEN NOT MATCHED THEN >>>>>>>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Ben >>>>>>>>>>> >>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it >>>>>>>>>>>> into gerrit if you want to take a look. >>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ >>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/> >>>>>>>>>>>> It does pushdown predicates which the existing input formatter >>>>>>>>>>>> based rdd does not. >>>>>>>>>>>> >>>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource >>>>>>>>>>>> for spark that will have pushdown predicates and insertion/update >>>>>>>>>>>> functionality (need to look more at cassandra and the hbase >>>>>>>>>>>> datasource for best way to do this) I agree that server side >>>>>>>>>>>> upsert would be helpful. >>>>>>>>>>>> Having a datasource would give us useful data frames and also make >>>>>>>>>>>> spark sql usable for kudu. >>>>>>>>>>>> >>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala >>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high >>>>>>>>>>>> concurrency when compared to spark 2. We interact with datasources >>>>>>>>>>>> which do not integrate with impala. 3. We have custom sql query >>>>>>>>>>>> planners for extended sql functionality. >>>>>>>>>>>> >>>>>>>>>>>> -Chris George >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org >>>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> You guys make a convincing point, although on the upsert side >>>>>>>>>>>> we'll need more support from the servers. Right now all you can do >>>>>>>>>>>> is an INSERT then, if you get a dup key, do an UPDATE. I guess we >>>>>>>>>>>> could at least add an API on the client side that would manage it, >>>>>>>>>>>> but it wouldn't be atomic. >>>>>>>>>>>> >>>>>>>>>>>> J-D >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra >>>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote: >>>>>>>>>>>> It's pretty simple, actually. I need to support versioned >>>>>>>>>>>> datasets in a Spark SQL environment. Instead of a hack on top of >>>>>>>>>>>> a Parquet data store, I'm hoping (among other reasons) to be able >>>>>>>>>>>> to use Kudu's write and timestamp-based read operations to support >>>>>>>>>>>> not only appending data, but also updating existing data, and even >>>>>>>>>>>> some schema migration. The most typical use case is a dataset >>>>>>>>>>>> that is updated periodically (e.g., weekly or monthly) in which >>>>>>>>>>>> the the preliminary data in the previous window (week or month) is >>>>>>>>>>>> updated with values that are expected to remain unchanged from >>>>>>>>>>>> then on, and a new set of preliminary values for the current >>>>>>>>>>>> window need to be added/appended. >>>>>>>>>>>> >>>>>>>>>>>> Using Kudu's Java API and developing additional functionality on >>>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease >>>>>>>>>>>> of integration with Spark SQL will gate how quickly we would move >>>>>>>>>>>> to using Kudu and how seriously we'd look at alternatives before >>>>>>>>>>>> making that decision. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans >>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>> Mark, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it >>>>>>>>>>>> caught the attention of other folks! >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra >>>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote: >>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>>>>>>>> >>>>>>>>>>>> I care about insert into Kudu with Spark SQL. I'm currently >>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert >>>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu. >>>>>>>>>>>> Whether Kudu does a good job supporting inserts with Spark SQL >>>>>>>>>>>> will be a key consideration as to whether we adopt Kudu. >>>>>>>>>>>> >>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for >>>>>>>>>>>> you. Is it just that you currently do it that way into some >>>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to >>>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and directly >>>>>>>>>>>> use the Java API's KuduSession be too much work? >>>>>>>>>>>> >>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your >>>>>>>>>>>> current solution? If it's not completely clear, I'd love to help >>>>>>>>>>>> you think through it. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans >>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>> Yup, starting to get a good idea. >>>>>>>>>>>> >>>>>>>>>>>> What are your DS folks looking for in terms of functionality >>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully featured >>>>>>>>>>>> as Impala's? Do they care being able to insert into Kudu with >>>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more >>>>>>>>>>>> specific to Spark that I'm missing? >>>>>>>>>>>> >>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At >>>>>>>>>>>> Cloudera all our resources are committed to making things happen >>>>>>>>>>>> in time, and a more fully featured Spark integration isn't in our >>>>>>>>>>>> plans during that period. I'm really hoping someone in the >>>>>>>>>>>> community will help with Spark, the same way we got a big >>>>>>>>>>>> contribution for the Flume sink. >>>>>>>>>>>> >>>>>>>>>>>> J-D >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, >>>>>>>>>>>> since it’s not “production-ready”, upper management doesn’t want >>>>>>>>>>>> to fully deploy it yet. They just want to keep an eye on it >>>>>>>>>>>> though. Kudu was so much simpler and easier to use in every aspect >>>>>>>>>>>> compared to HBase. Impala was great for the report writers and >>>>>>>>>>>> analysts to experiment with for the short time it was up. But, >>>>>>>>>>>> once again, the only blocker was the lack of Spark support for our >>>>>>>>>>>> Data Developers/Scientists. So, production-level data population >>>>>>>>>>>> won’t happen until then. >>>>>>>>>>>> >>>>>>>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans >>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim >>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>> J-D, >>>>>>>>>>>>> >>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an >>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken care >>>>>>>>>>>>> of and idempotency is maintained. Whether data was directly >>>>>>>>>>>>> retrieved from Cassandra for analytics, reports, or searches, it >>>>>>>>>>>>> was not clear as to what was its main use. Some also just used it >>>>>>>>>>>>> for a staging area to populate downstream tables in parquet >>>>>>>>>>>>> format. The last thing I heard was that CQL was terrible, so that >>>>>>>>>>>>> rules out much use of direct queries against it. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real >>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. >>>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for >>>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> As for our company, we have been looking for an updatable data >>>>>>>>>>>>> store for a long time that can be quickly queried directly either >>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still >>>>>>>>>>>>> handle TB or PB of data without performance degradation and many >>>>>>>>>>>>> configuration headaches. For now, we are using HBase to take on >>>>>>>>>>>>> this role with Phoenix as a fast way to directly query the data. >>>>>>>>>>>>> I can see Kudu as the best way to fill this gap easily, >>>>>>>>>>>>> especially being the closest thing to other relational databases >>>>>>>>>>>>> out there in familiarity for the many SQL analytics people in our >>>>>>>>>>>>> company. The other alternative would be to go with AWS Redshift >>>>>>>>>>>>> for the same reasons, but it would come at a cost, of course. If >>>>>>>>>>>>> we went with either solutions, Kudu or Redshift, it would get rid >>>>>>>>>>>>> of the need to extract from HBase to parquet tables or export to >>>>>>>>>>>>> PostgreSQL to support more of the SQL language using by analysts >>>>>>>>>>>>> or the reporting software we use.. >>>>>>>>>>>>> >>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with >>>>>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use >>>>>>>>>>>>> cases? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I hope this helps. >>>>>>>>>>>>> >>>>>>>>>>>>> It does, thanks for nice reply. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Ben >>>>>>>>>>>>> >>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans >>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like >>>>>>>>>>>>>> to refer to "Impala + Kudu" as Kimpala, but yeah it's not as >>>>>>>>>>>>>> sexy. My colleagues who were also there did say that the hype >>>>>>>>>>>>>> around Spark isn't dying down. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, >>>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* >>>>>>>>>>>>>> is just an interim solution for the use case you describe. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a >>>>>>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see >>>>>>>>>>>>>> more contributions on the Spark front. I know there's code out >>>>>>>>>>>>>> there that could be integrated in kudu-spark, it just needs to >>>>>>>>>>>>>> land in gerrit. I'm sure folks will happily review it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to >>>>>>>>>>>>>> learn more about the use cases for which you envision using Kudu >>>>>>>>>>>>>> as a C* replacement. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> J-D >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim >>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>> >>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They >>>>>>>>>>>>>> told me that everything was about Spark and there is a big buzz >>>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I >>>>>>>>>>>>>> still think that Cassandra is just an interim solution as a >>>>>>>>>>>>>> low-latency, easily queried data store. I was wondering if >>>>>>>>>>>>>> anything significant happened in regards to Kudu, especially on >>>>>>>>>>>>>> the Spark front. Plus, can you come up with your own proposed >>>>>>>>>>>>>> stack acronym to promote? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> Ben >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans >>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Ben, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I >>>>>>>>>>>>>>> know of one person on the Kudu Slack who's working on a better >>>>>>>>>>>>>>> RDD, but that's about it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com >>>>>>>>>>>>>>> <mailto:b...@amobee.com>> wrote: >>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target >>>>>>>>>>>>>>> a version of Kudu to begin real testing of Spark against it for >>>>>>>>>>>>>>> our devs. At least, I can tell them what timeframe to >>>>>>>>>>>>>>> anticipate. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just curious, >>>>>>>>>>>>>>> Benjamin Kim >>>>>>>>>>>>>>> Data Solutions Architect >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900> >>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans >>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's >>>>>>>>>>>>>>>> needed either. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally >>>>>>>>>>>>>>>> we'd use scans directly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of >>>>>>>>>>>>>>>> pushdown. It's really basic. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The goal was to provide something for others to contribute to. >>>>>>>>>>>>>>>> We have some basic unit tests that others can easily extend. >>>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be really >>>>>>>>>>>>>>>> happy to assist one improve the kudu-spark code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim >>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu >>>>>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring >>>>>>>>>>>>>>>> up more Spark SQL functionality (Dataframes) and doing the >>>>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with >>>>>>>>>>>>>>>> Kudu and compare it to HBase with Spark (not clean). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans >>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this >>>>>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on >>>>>>>>>>>>>>>>> Kudu, but it will require a lot more work to make it >>>>>>>>>>>>>>>>> fast/useful. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hope this helps, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim >>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for >>>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, >>>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu both >>>>>>>>>>>>>>>>> programmatically and as a client via Spark SQL? Or is there >>>>>>>>>>>>>>>>> more work that needs to be done on the Spark side for it to >>>>>>>>>>>>>>>>> work? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just curious. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> >> > >