Dan, Thanks! It got further. Now, how do I set the Primary Key to be a column(s) in the DataFrame and set the partitioning? Is it like this?
kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id")) java.lang.IllegalArgumentException: Table partitioning must be specified using setRangePartitionColumns or addHashPartitions Thanks, Ben > On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote: > > Looks like we're missing an import statement in that example. Could you try: > > import org.kududb.client._ > and try again? > > - Dan > > On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > I encountered an error trying to create a table based on the documentation > from a DataFrame. > > <console>:49: error: not found: type CreateTableOptions > kuduContext.createTable(tableName, df.schema, Seq("key"), new > CreateTableOptions().setNumReplicas(1)) > > Is there something I’m missing? > > Thanks, > Ben > >> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >> <mailto:jdcry...@apache.org>> wrote: >> >> It's only in Cloudera's maven repo: >> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/ >> >> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/> >> >> J-D >> >> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> Hi J-D, >> >> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for >> spark-shell to use. Can you show me where to find it? >> >> Thanks, >> Ben >> >> >>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org >>> <mailto:jdcry...@apache.org>> wrote: >>> >>> What's in this doc is what's gonna get released: >>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark >>> >>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark> >>> >>> J-D >>> >>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Will this be documented with examples once 0.9.0 comes out? >>> >>> Thanks, >>> Ben >>> >>> >>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>> <mailto:jdcry...@apache.org>> wrote: >>>> >>>> It will be in 0.9.0. >>>> >>>> J-D >>>> >>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> Hi Chris, >>>> >>>> Will all this effort be rolled into 0.9.0 and be ready for use? >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com >>>>> <mailto:christopher.geo...@rms.com>> wrote: >>>>> >>>>> There is some code in review that needs some more refinement. >>>>> It will allow upsert/insert from a dataframe using the datasource api. It >>>>> will also allow the creation and deletion of tables from a dataframe >>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> >>>>> >>>>> Example usages will look something like: >>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc> >>>>> >>>>> -Chris George >>>>> >>>>> >>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> >>>>> Can someone tell me what the state is of this Spark work? >>>>> >>>>> Also, does anyone have any sample code on how to update/insert data in >>>>> Kudu using DataFrames? >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com >>>>>> <mailto:christopher.geo...@rms.com>> wrote: >>>>>> >>>>>> SparkSQL cannot support these type of statements but we may be able to >>>>>> implement similar functionality through the api. >>>>>> -Chris >>>>>> >>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> >>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if >>>>>> it were to be implemented. >>>>>> >>>>>> MERGE INTO table_name USING table_reference ON (condition) >>>>>> WHEN MATCHED THEN >>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>>>> WHEN NOT MATCHED THEN >>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>>>> >>>>>> Cheers, >>>>>> Ben >>>>>> >>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com >>>>>>> <mailto:christopher.geo...@rms.com>> wrote: >>>>>>> >>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into >>>>>>> gerrit if you want to take a look. >>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ >>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/> >>>>>>> It does pushdown predicates which the existing input formatter based >>>>>>> rdd does not. >>>>>>> >>>>>>> Within the next two weeks I’m planning to implement a datasource for >>>>>>> spark that will have pushdown predicates and insertion/update >>>>>>> functionality (need to look more at cassandra and the hbase datasource >>>>>>> for best way to do this) I agree that server side upsert would be >>>>>>> helpful. >>>>>>> Having a datasource would give us useful data frames and also make >>>>>>> spark sql usable for kudu. >>>>>>> >>>>>>> My reasoning for having a spark datasource and not using Impala is: 1. >>>>>>> We have had trouble getting impala to run fast with high concurrency >>>>>>> when compared to spark 2. We interact with datasources which do not >>>>>>> integrate with impala. 3. We have custom sql query planners for >>>>>>> extended sql functionality. >>>>>>> >>>>>>> -Chris George >>>>>>> >>>>>>> >>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org >>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>> >>>>>>> You guys make a convincing point, although on the upsert side we'll >>>>>>> need more support from the servers. Right now all you can do is an >>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could at >>>>>>> least add an API on the client side that would manage it, but it >>>>>>> wouldn't be atomic. >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com >>>>>>> <mailto:m...@clearstorydata.com>> wrote: >>>>>>> It's pretty simple, actually. I need to support versioned datasets in >>>>>>> a Spark SQL environment. Instead of a hack on top of a Parquet data >>>>>>> store, I'm hoping (among other reasons) to be able to use Kudu's write >>>>>>> and timestamp-based read operations to support not only appending data, >>>>>>> but also updating existing data, and even some schema migration. The >>>>>>> most typical use case is a dataset that is updated periodically (e.g., >>>>>>> weekly or monthly) in which the the preliminary data in the previous >>>>>>> window (week or month) is updated with values that are expected to >>>>>>> remain unchanged from then on, and a new set of preliminary values for >>>>>>> the current window need to be added/appended. >>>>>>> >>>>>>> Using Kudu's Java API and developing additional functionality on top of >>>>>>> what Kudu has to offer isn't too much to ask, but the ease of >>>>>>> integration with Spark SQL will gate how quickly we would move to using >>>>>>> Kudu and how seriously we'd look at alternatives before making that >>>>>>> decision. >>>>>>> >>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans >>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>> Mark, >>>>>>> >>>>>>> Thanks for taking some time to reply in this thread, glad it caught the >>>>>>> attention of other folks! >>>>>>> >>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com >>>>>>> <mailto:m...@clearstorydata.com>> wrote: >>>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>>> >>>>>>> I care about insert into Kudu with Spark SQL. I'm currently delaying a >>>>>>> refactoring of some Spark SQL-oriented insert functionality while >>>>>>> trying to evaluate what to expect from Kudu. Whether Kudu does a good >>>>>>> job supporting inserts with Spark SQL will be a key consideration as to >>>>>>> whether we adopt Kudu. >>>>>>> >>>>>>> I'd like to know more about why SparkSQL inserts in necessary for you. >>>>>>> Is it just that you currently do it that way into some database or >>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? Would >>>>>>> re-writing those SQL lines into Scala and directly use the Java API's >>>>>>> KuduSession be too much work? >>>>>>> >>>>>>> Additionally, what do you expect to gain from using Kudu VS your >>>>>>> current solution? If it's not completely clear, I'd love to help you >>>>>>> think through it. >>>>>>> >>>>>>> >>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans >>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>> Yup, starting to get a good idea. >>>>>>> >>>>>>> What are your DS folks looking for in terms of functionality related to >>>>>>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do >>>>>>> they care being able to insert into Kudu with SparkSQL or just being >>>>>>> able to query real fast? Anything more specific to Spark that I'm >>>>>>> missing? >>>>>>> >>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera >>>>>>> all our resources are committed to making things happen in time, and a >>>>>>> more fully featured Spark integration isn't in our plans during that >>>>>>> period. I'm really hoping someone in the community will help with >>>>>>> Spark, the same way we got a big contribution for the Flume sink. >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since >>>>>>> it’s not “production-ready”, upper management doesn’t want to fully >>>>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so >>>>>>> much simpler and easier to use in every aspect compared to HBase. >>>>>>> Impala was great for the report writers and analysts to experiment with >>>>>>> for the short time it was up. But, once again, the only blocker was the >>>>>>> lack of Spark support for our Data Developers/Scientists. So, >>>>>>> production-level data population won’t happen until then. >>>>>>> >>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>> >>>>>>> Cheers, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>> >>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com >>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>> J-D, >>>>>>>> >>>>>>>> The main thing I hear that Cassandra is being used as an updatable hot >>>>>>>> data store to ensure that duplicates are taken care of and idempotency >>>>>>>> is maintained. Whether data was directly retrieved from Cassandra for >>>>>>>> analytics, reports, or searches, it was not clear as to what was its >>>>>>>> main use. Some also just used it for a staging area to populate >>>>>>>> downstream tables in parquet format. The last thing I heard was that >>>>>>>> CQL was terrible, so that rules out much use of direct queries against >>>>>>>> it. >>>>>>>> >>>>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics, >>>>>>>> just ease of use instead of plainly using the APIs. Even then, Kudu >>>>>>>> should beat it easily on big scans. Same for HBase. We've done >>>>>>>> benchmarks against the latter, not the former. >>>>>>>> >>>>>>>> >>>>>>>> As for our company, we have been looking for an updatable data store >>>>>>>> for a long time that can be quickly queried directly either using >>>>>>>> Spark SQL or Impala or some other SQL engine and still handle TB or PB >>>>>>>> of data without performance degradation and many configuration >>>>>>>> headaches. For now, we are using HBase to take on this role with >>>>>>>> Phoenix as a fast way to directly query the data. I can see Kudu as >>>>>>>> the best way to fill this gap easily, especially being the closest >>>>>>>> thing to other relational databases out there in familiarity for the >>>>>>>> many SQL analytics people in our company. The other alternative would >>>>>>>> be to go with AWS Redshift for the same reasons, but it would come at >>>>>>>> a cost, of course. If we went with either solutions, Kudu or Redshift, >>>>>>>> it would get rid of the need to extract from HBase to parquet tables >>>>>>>> or export to PostgreSQL to support more of the SQL language using by >>>>>>>> analysts or the reporting software we use.. >>>>>>>> >>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with >>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use cases? >>>>>>>> >>>>>>>> >>>>>>>> I hope this helps. >>>>>>>> >>>>>>>> It does, thanks for nice reply. >>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Ben >>>>>>>> >>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>> >>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to >>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My >>>>>>>>> colleagues who were also there did say that the hype around Spark >>>>>>>>> isn't dying down. >>>>>>>>> >>>>>>>>> There's definitely an overlap in the use cases that Cassandra, HBase, >>>>>>>>> and Kudu cater to. I wouldn't go as far as saying that C* is just an >>>>>>>>> interim solution for the use case you describe. >>>>>>>>> >>>>>>>>> Nothing significant happened in Kudu over the past month, it's a >>>>>>>>> storage engine so things move slowly *smile*. I'd love to see more >>>>>>>>> contributions on the Spark front. I know there's code out there that >>>>>>>>> could be integrated in kudu-spark, it just needs to land in gerrit. >>>>>>>>> I'm sure folks will happily review it. >>>>>>>>> >>>>>>>>> Do you have relevant experiences you can share? I'd love to learn >>>>>>>>> more about the use cases for which you envision using Kudu as a C* >>>>>>>>> replacement. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>> Hi J-D, >>>>>>>>> >>>>>>>>> My colleagues recently came back from Strata in San Jose. They told >>>>>>>>> me that everything was about Spark and there is a big buzz about the >>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think >>>>>>>>> that Cassandra is just an interim solution as a low-latency, easily >>>>>>>>> queried data store. I was wondering if anything significant happened >>>>>>>>> in regards to Kudu, especially on the Spark front. Plus, can you come >>>>>>>>> up with your own proposed stack acronym to promote? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Ben, >>>>>>>>>> >>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I know >>>>>>>>>> of one person on the Kudu Slack who's working on a better RDD, but >>>>>>>>>> that's about it. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> J-D >>>>>>>>>> >>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com >>>>>>>>>> <mailto:b...@amobee.com>> wrote: >>>>>>>>>> Hi J-D, >>>>>>>>>> >>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a >>>>>>>>>> version of Kudu to begin real testing of Spark against it for our >>>>>>>>>> devs. At least, I can tell them what timeframe to anticipate. >>>>>>>>>> >>>>>>>>>> Just curious, >>>>>>>>>> Benjamin Kim >>>>>>>>>> Data Solutions Architect >>>>>>>>>> >>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing. >>>>>>>>>> >>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900> >>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>>>>>>>> www.amobee.com <http://www.amobee.com/> >>>>>>>>>> >>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans >>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>> >>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed >>>>>>>>>>> either. >>>>>>>>>>> >>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd >>>>>>>>>>> use scans directly. >>>>>>>>>>> >>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. >>>>>>>>>>> It's really basic. >>>>>>>>>>> >>>>>>>>>>> The goal was to provide something for others to contribute to. We >>>>>>>>>>> have some basic unit tests that others can easily extend. None of >>>>>>>>>>> us on the team are Spark experts, but we'd be really happy to >>>>>>>>>>> assist one improve the kudu-spark code. >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>> J-D, >>>>>>>>>>> >>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu RDD, >>>>>>>>>>> kudu DStream) in KUDU-1214. Am I right? Besides shoring up more >>>>>>>>>>> Spark SQL functionality (Dataframes) and doing the documentation, >>>>>>>>>>> what more needs to be done? Optimizations? >>>>>>>>>>> >>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu and >>>>>>>>>>> compare it to HBase with Spark (not clean). >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ben >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans >>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in >>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321> >>>>>>>>>>>> >>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on >>>>>>>>>>>> Kudu, but it will require a lot more work to make it fast/useful. >>>>>>>>>>>> >>>>>>>>>>>> Hope this helps, >>>>>>>>>>>> >>>>>>>>>>>> J-D >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 0.8.0, >>>>>>>>>>>> but I see no progress on it. When this is complete, will this mean >>>>>>>>>>>> that Spark will be able to work with Kudu both programmatically >>>>>>>>>>>> and as a client via Spark SQL? Or is there more work that needs to >>>>>>>>>>>> be done on the Spark side for it to work? >>>>>>>>>>>> >>>>>>>>>>>> Just curious. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >> > >