Ah, that makes more sense when you put it that way. Thanks, Ben
> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote: > > I'm not sure exactly what the semantics will be, but at least one of them > will be upsert. These modes come from spark, and they were really designed > for file-backed storage and not table storage. We may want to do append = > upsert, and overwrite = truncate + insert. I think that may match the normal > spark semantics more closely. > > - Dan > > On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Dan, > > Thanks for the information. That would mean both “append” and “overwrite” > modes would be combined or not needed in the future. > > Cheers, > Ben > >> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com >> <mailto:d...@cloudera.com>> wrote: >> >> Right now append uses an update Kudu operation, which requires the row >> already be present in the table. Overwrite maps to insert. Kudu very >> recently got upsert support baked in, but it hasn't yet been integrated into >> the Spark connector. So pretty soon these sharp edges will get a lot >> better, since upsert is the way to go for most spark workloads. >> >> - Dan >> >> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> I tried to use the “append” mode, and it worked. Over 3.8 million rows in >> 64s. I would assume that now I can use the “overwrite” mode on existing >> data. Now, I have to find answers to these questions. What would happen if I >> “append” to the data in the Kudu table if the data already exists? What >> would happen if I “overwrite” existing data when the DataFrame has data in >> it that does not exist in the Kudu table? I need to evaluate the best way to >> simulate the UPSERT behavior in HBase because this is what our use case is. >> >> Thanks, >> Ben >> >> >> >>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> >>> Hi, >>> >>> Now, I’m getting this error when trying to write to the table. >>> >>> import scala.collection.JavaConverters._ >>> val key_seq = Seq(“my_id") >>> val key_list = List(“my_id”).asJava >>> kuduContext.createTable(tableName, df.schema, key_seq, new >>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100)) >>> >>> df.write >>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName)) >>> .mode("overwrite") >>> .kudu >>> >>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to >>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not >>> found (error 0)Not found: key not found (error 0)Not found: key not found >>> (error 0)Not found: key not found (error 0) >>> >>> Does the key field need to be first in the DataFrame? >>> >>> Thanks, >>> Ben >>> >>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com >>>> <mailto:d...@cloudera.com>> wrote: >>>> >>>> >>>> >>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> Dan, >>>> >>>> Thanks! It got further. Now, how do I set the Primary Key to be a >>>> column(s) in the DataFrame and set the partitioning? Is it like this? >>>> >>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new >>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id")) >>>> >>>> java.lang.IllegalArgumentException: Table partitioning must be specified >>>> using setRangePartitionColumns or addHashPartitions >>>> >>>> Yep. The `Seq("my_id")` part of that call is specifying the set of >>>> primary key columns, so in this case you have specified the single PK >>>> column "my_id". The `addHashPartitions` call adds hash partitioning to >>>> the table, in this case over the column "my_id" (which is good, it must be >>>> over one or more PK columns, so in this case "my_id" is the one and only >>>> valid combination). However, the call to `addHashPartition` also takes >>>> the number of buckets as the second param. You shouldn't get the >>>> IllegalArgumentException as long as you are specifying either >>>> `addHashPartitions` or `setRangePartitionColumns`. >>>> >>>> - Dan >>>> >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com >>>>> <mailto:d...@cloudera.com>> wrote: >>>>> >>>>> Looks like we're missing an import statement in that example. Could you >>>>> try: >>>>> >>>>> import org.kududb.client._ >>>>> and try again? >>>>> >>>>> - Dan >>>>> >>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> I encountered an error trying to create a table based on the >>>>> documentation from a DataFrame. >>>>> >>>>> <console>:49: error: not found: type CreateTableOptions >>>>> kuduContext.createTable(tableName, df.schema, Seq("key"), >>>>> new CreateTableOptions().setNumReplicas(1)) >>>>> >>>>> Is there something I’m missing? >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>> >>>>>> It's only in Cloudera's maven repo: >>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/ >>>>>> >>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/> >>>>>> >>>>>> J-D >>>>>> >>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> Hi J-D, >>>>>> >>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for >>>>>> spark-shell to use. Can you show me where to find it? >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> >>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>> >>>>>>> What's in this doc is what's gonna get released: >>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark >>>>>>> >>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark> >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>> Will this be documented with examples once 0.9.0 comes out? >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>> >>>>>>>> It will be in 0.9.0. >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>> Hi Chris, >>>>>>>> >>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com >>>>>>>>> <mailto:christopher.geo...@rms.com>> wrote: >>>>>>>>> >>>>>>>>> There is some code in review that needs some more refinement. >>>>>>>>> It will allow upsert/insert from a dataframe using the datasource >>>>>>>>> api. It will also allow the creation and deletion of tables from a >>>>>>>>> dataframe >>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> >>>>>>>>> >>>>>>>>> Example usages will look something like: >>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc> >>>>>>>>> >>>>>>>>> -Chris George >>>>>>>>> >>>>>>>>> >>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com >>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> Can someone tell me what the state is of this Spark work? >>>>>>>>> >>>>>>>>> Also, does anyone have any sample code on how to update/insert data >>>>>>>>> in Kudu using DataFrames? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George >>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> SparkSQL cannot support these type of statements but we may be able >>>>>>>>>> to implement similar functionality through the api. >>>>>>>>>> -Chris >>>>>>>>>> >>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com >>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” >>>>>>>>>> if it were to be implemented. >>>>>>>>>> >>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition) >>>>>>>>>> WHEN MATCHED THEN >>>>>>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>>>>>>>> WHEN NOT MATCHED THEN >>>>>>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George >>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into >>>>>>>>>>> gerrit if you want to take a look. >>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ >>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/> >>>>>>>>>>> It does pushdown predicates which the existing input formatter >>>>>>>>>>> based rdd does not. >>>>>>>>>>> >>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource >>>>>>>>>>> for spark that will have pushdown predicates and insertion/update >>>>>>>>>>> functionality (need to look more at cassandra and the hbase >>>>>>>>>>> datasource for best way to do this) I agree that server side upsert >>>>>>>>>>> would be helpful. >>>>>>>>>>> Having a datasource would give us useful data frames and also make >>>>>>>>>>> spark sql usable for kudu. >>>>>>>>>>> >>>>>>>>>>> My reasoning for having a spark datasource and not using Impala is: >>>>>>>>>>> 1. We have had trouble getting impala to run fast with high >>>>>>>>>>> concurrency when compared to spark 2. We interact with datasources >>>>>>>>>>> which do not integrate with impala. 3. We have custom sql query >>>>>>>>>>> planners for extended sql functionality. >>>>>>>>>>> >>>>>>>>>>> -Chris George >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org >>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>> >>>>>>>>>>> You guys make a convincing point, although on the upsert side we'll >>>>>>>>>>> need more support from the servers. Right now all you can do is an >>>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could >>>>>>>>>>> at least add an API on the client side that would manage it, but it >>>>>>>>>>> wouldn't be atomic. >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra >>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote: >>>>>>>>>>> It's pretty simple, actually. I need to support versioned datasets >>>>>>>>>>> in a Spark SQL environment. Instead of a hack on top of a Parquet >>>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use >>>>>>>>>>> Kudu's write and timestamp-based read operations to support not >>>>>>>>>>> only appending data, but also updating existing data, and even some >>>>>>>>>>> schema migration. The most typical use case is a dataset that is >>>>>>>>>>> updated periodically (e.g., weekly or monthly) in which the the >>>>>>>>>>> preliminary data in the previous window (week or month) is updated >>>>>>>>>>> with values that are expected to remain unchanged from then on, and >>>>>>>>>>> a new set of preliminary values for the current window need to be >>>>>>>>>>> added/appended. >>>>>>>>>>> >>>>>>>>>>> Using Kudu's Java API and developing additional functionality on >>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease >>>>>>>>>>> of integration with Spark SQL will gate how quickly we would move >>>>>>>>>>> to using Kudu and how seriously we'd look at alternatives before >>>>>>>>>>> making that decision. >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans >>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>> Mark, >>>>>>>>>>> >>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it caught >>>>>>>>>>> the attention of other folks! >>>>>>>>>>> >>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra >>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote: >>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>>>>>>> >>>>>>>>>>> I care about insert into Kudu with Spark SQL. I'm currently >>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert >>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu. >>>>>>>>>>> Whether Kudu does a good job supporting inserts with Spark SQL will >>>>>>>>>>> be a key consideration as to whether we adopt Kudu. >>>>>>>>>>> >>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for >>>>>>>>>>> you. Is it just that you currently do it that way into some >>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to >>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and directly >>>>>>>>>>> use the Java API's KuduSession be too much work? >>>>>>>>>>> >>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your >>>>>>>>>>> current solution? If it's not completely clear, I'd love to help >>>>>>>>>>> you think through it. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans >>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>> Yup, starting to get a good idea. >>>>>>>>>>> >>>>>>>>>>> What are your DS folks looking for in terms of functionality >>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully featured >>>>>>>>>>> as Impala's? Do they care being able to insert into Kudu with >>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more >>>>>>>>>>> specific to Spark that I'm missing? >>>>>>>>>>> >>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At >>>>>>>>>>> Cloudera all our resources are committed to making things happen in >>>>>>>>>>> time, and a more fully featured Spark integration isn't in our >>>>>>>>>>> plans during that period. I'm really hoping someone in the >>>>>>>>>>> community will help with Spark, the same way we got a big >>>>>>>>>>> contribution for the Flume sink. >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, >>>>>>>>>>> since it’s not “production-ready”, upper management doesn’t want to >>>>>>>>>>> fully deploy it yet. They just want to keep an eye on it though. >>>>>>>>>>> Kudu was so much simpler and easier to use in every aspect compared >>>>>>>>>>> to HBase. Impala was great for the report writers and analysts to >>>>>>>>>>> experiment with for the short time it was up. But, once again, the >>>>>>>>>>> only blocker was the lack of Spark support for our Data >>>>>>>>>>> Developers/Scientists. So, production-level data population won’t >>>>>>>>>>> happen until then. >>>>>>>>>>> >>>>>>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Ben >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans >>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>> J-D, >>>>>>>>>>>> >>>>>>>>>>>> The main thing I hear that Cassandra is being used as an updatable >>>>>>>>>>>> hot data store to ensure that duplicates are taken care of and >>>>>>>>>>>> idempotency is maintained. Whether data was directly retrieved >>>>>>>>>>>> from Cassandra for analytics, reports, or searches, it was not >>>>>>>>>>>> clear as to what was its main use. Some also just used it for a >>>>>>>>>>>> staging area to populate downstream tables in parquet format. The >>>>>>>>>>>> last thing I heard was that CQL was terrible, so that rules out >>>>>>>>>>>> much use of direct queries against it. >>>>>>>>>>>> >>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real >>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. >>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for >>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> As for our company, we have been looking for an updatable data >>>>>>>>>>>> store for a long time that can be quickly queried directly either >>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still >>>>>>>>>>>> handle TB or PB of data without performance degradation and many >>>>>>>>>>>> configuration headaches. For now, we are using HBase to take on >>>>>>>>>>>> this role with Phoenix as a fast way to directly query the data. I >>>>>>>>>>>> can see Kudu as the best way to fill this gap easily, especially >>>>>>>>>>>> being the closest thing to other relational databases out there in >>>>>>>>>>>> familiarity for the many SQL analytics people in our company. The >>>>>>>>>>>> other alternative would be to go with AWS Redshift for the same >>>>>>>>>>>> reasons, but it would come at a cost, of course. If we went with >>>>>>>>>>>> either solutions, Kudu or Redshift, it would get rid of the need >>>>>>>>>>>> to extract from HBase to parquet tables or export to PostgreSQL to >>>>>>>>>>>> support more of the SQL language using by analysts or the >>>>>>>>>>>> reporting software we use.. >>>>>>>>>>>> >>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with >>>>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use >>>>>>>>>>>> cases? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I hope this helps. >>>>>>>>>>>> >>>>>>>>>>>> It does, thanks for nice reply. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans >>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to >>>>>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. >>>>>>>>>>>>> My colleagues who were also there did say that the hype around >>>>>>>>>>>>> Spark isn't dying down. >>>>>>>>>>>>> >>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, >>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* >>>>>>>>>>>>> is just an interim solution for the use case you describe. >>>>>>>>>>>>> >>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a >>>>>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see >>>>>>>>>>>>> more contributions on the Spark front. I know there's code out >>>>>>>>>>>>> there that could be integrated in kudu-spark, it just needs to >>>>>>>>>>>>> land in gerrit. I'm sure folks will happily review it. >>>>>>>>>>>>> >>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to learn >>>>>>>>>>>>> more about the use cases for which you envision using Kudu as a >>>>>>>>>>>>> C* replacement. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> J-D >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>> >>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They >>>>>>>>>>>>> told me that everything was about Spark and there is a big buzz >>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I >>>>>>>>>>>>> still think that Cassandra is just an interim solution as a >>>>>>>>>>>>> low-latency, easily queried data store. I was wondering if >>>>>>>>>>>>> anything significant happened in regards to Kudu, especially on >>>>>>>>>>>>> the Spark front. Plus, can you come up with your own proposed >>>>>>>>>>>>> stack acronym to promote? >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Ben >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans >>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Ben, >>>>>>>>>>>>>> >>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I >>>>>>>>>>>>>> know of one person on the Kudu Slack who's working on a better >>>>>>>>>>>>>> RDD, but that's about it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> J-D >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com >>>>>>>>>>>>>> <mailto:b...@amobee.com>> wrote: >>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target >>>>>>>>>>>>>> a version of Kudu to begin real testing of Spark against it for >>>>>>>>>>>>>> our devs. At least, I can tell them what timeframe to anticipate. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just curious, >>>>>>>>>>>>>> Benjamin Kim >>>>>>>>>>>>>> Data Solutions Architect >>>>>>>>>>>>>> >>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900> >>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans >>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's >>>>>>>>>>>>>>> needed either. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally >>>>>>>>>>>>>>> we'd use scans directly. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of >>>>>>>>>>>>>>> pushdown. It's really basic. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The goal was to provide something for others to contribute to. >>>>>>>>>>>>>>> We have some basic unit tests that others can easily extend. >>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be really >>>>>>>>>>>>>>> happy to assist one improve the kudu-spark code. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim >>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu >>>>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring up >>>>>>>>>>>>>>> more Spark SQL functionality (Dataframes) and doing the >>>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu >>>>>>>>>>>>>>> and compare it to HBase with Spark (not clean). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans >>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this >>>>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on >>>>>>>>>>>>>>>> Kudu, but it will require a lot more work to make it >>>>>>>>>>>>>>>> fast/useful. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hope this helps, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim >>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for >>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, >>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu both >>>>>>>>>>>>>>>> programmatically and as a client via Spark SQL? Or is there >>>>>>>>>>>>>>>> more work that needs to be done on the Spark side for it to >>>>>>>>>>>>>>>> work? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Just curious. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> >> > >