Chris, That would be great! And a first! I think everyone would take notice if KImpala had this.
Cheers, Ben > On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com> wrote: > > SparkSQL cannot support these type of statements but we may be able to > implement similar functionality through the api. > -Chris > > On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > > It would be nice to adhere to the SQL:2003 standard for an “upsert” if it > were to be implemented. > > MERGE INTO table_name USING table_reference ON (condition) > WHEN MATCHED THEN > UPDATE SET column1 = value1 [, column2 = value2 ...] > WHEN NOT MATCHED THEN > INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) > > Cheers, > Ben > >> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com >> <mailto:christopher.geo...@rms.com>> wrote: >> >> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit >> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ >> <http://gerrit.cloudera.org:8080/#/c/2754/> >> It does pushdown predicates which the existing input formatter based rdd >> does not. >> >> Within the next two weeks I’m planning to implement a datasource for spark >> that will have pushdown predicates and insertion/update functionality (need >> to look more at cassandra and the hbase datasource for best way to do this) >> I agree that server side upsert would be helpful. >> Having a datasource would give us useful data frames and also make spark sql >> usable for kudu. >> >> My reasoning for having a spark datasource and not using Impala is: 1. We >> have had trouble getting impala to run fast with high concurrency when >> compared to spark 2. We interact with datasources which do not integrate >> with impala. 3. We have custom sql query planners for extended sql >> functionality. >> >> -Chris George >> >> >> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org >> <mailto:jdcry...@apache.org>> wrote: >> >> You guys make a convincing point, although on the upsert side we'll need >> more support from the servers. Right now all you can do is an INSERT then, >> if you get a dup key, do an UPDATE. I guess we could at least add an API on >> the client side that would manage it, but it wouldn't be atomic. >> >> J-D >> >> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com >> <mailto:m...@clearstorydata.com>> wrote: >> It's pretty simple, actually. I need to support versioned datasets in a >> Spark SQL environment. Instead of a hack on top of a Parquet data store, >> I'm hoping (among other reasons) to be able to use Kudu's write and >> timestamp-based read operations to support not only appending data, but also >> updating existing data, and even some schema migration. The most typical >> use case is a dataset that is updated periodically (e.g., weekly or monthly) >> in which the the preliminary data in the previous window (week or month) is >> updated with values that are expected to remain unchanged from then on, and >> a new set of preliminary values for the current window need to be >> added/appended. >> >> Using Kudu's Java API and developing additional functionality on top of what >> Kudu has to offer isn't too much to ask, but the ease of integration with >> Spark SQL will gate how quickly we would move to using Kudu and how >> seriously we'd look at alternatives before making that decision. >> >> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org >> <mailto:jdcry...@apache.org>> wrote: >> Mark, >> >> Thanks for taking some time to reply in this thread, glad it caught the >> attention of other folks! >> >> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com >> <mailto:m...@clearstorydata.com>> wrote: >> Do they care being able to insert into Kudu with SparkSQL >> >> I care about insert into Kudu with Spark SQL. I'm currently delaying a >> refactoring of some Spark SQL-oriented insert functionality while trying to >> evaluate what to expect from Kudu. Whether Kudu does a good job supporting >> inserts with Spark SQL will be a key consideration as to whether we adopt >> Kudu. >> >> I'd like to know more about why SparkSQL inserts in necessary for you. Is it >> just that you currently do it that way into some database or parquet so with >> minimal refactoring you'd be able to use Kudu? Would re-writing those SQL >> lines into Scala and directly use the Java API's KuduSession be too much >> work? >> >> Additionally, what do you expect to gain from using Kudu VS your current >> solution? If it's not completely clear, I'd love to help you think through >> it. >> >> >> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcry...@apache.org >> <mailto:jdcry...@apache.org>> wrote: >> Yup, starting to get a good idea. >> >> What are your DS folks looking for in terms of functionality related to >> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they >> care being able to insert into Kudu with SparkSQL or just being able to >> query real fast? Anything more specific to Spark that I'm missing? >> >> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all >> our resources are committed to making things happen in time, and a more >> fully featured Spark integration isn't in our plans during that period. I'm >> really hoping someone in the community will help with Spark, the same way we >> got a big contribution for the Flume sink. >> >> J-D >> >> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since it’s >> not “production-ready”, upper management doesn’t want to fully deploy it >> yet. They just want to keep an eye on it though. Kudu was so much simpler >> and easier to use in every aspect compared to HBase. Impala was great for >> the report writers and analysts to experiment with for the short time it was >> up. But, once again, the only blocker was the lack of Spark support for our >> Data Developers/Scientists. So, production-level data population won’t >> happen until then. >> >> I hope this helps you get an idea where I am coming from… >> >> Cheers, >> Ben >> >> >>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org >>> <mailto:jdcry...@apache.org>> wrote: >>> >>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> J-D, >>> >>> The main thing I hear that Cassandra is being used as an updatable hot data >>> store to ensure that duplicates are taken care of and idempotency is >>> maintained. Whether data was directly retrieved from Cassandra for >>> analytics, reports, or searches, it was not clear as to what was its main >>> use. Some also just used it for a staging area to populate downstream >>> tables in parquet format. The last thing I heard was that CQL was terrible, >>> so that rules out much use of direct queries against it. >>> >>> I'm no C* expert, but I don't think CQL is meant for real analytics, just >>> ease of use instead of plainly using the APIs. Even then, Kudu should beat >>> it easily on big scans. Same for HBase. We've done benchmarks against the >>> latter, not the former. >>> >>> >>> As for our company, we have been looking for an updatable data store for a >>> long time that can be quickly queried directly either using Spark SQL or >>> Impala or some other SQL engine and still handle TB or PB of data without >>> performance degradation and many configuration headaches. For now, we are >>> using HBase to take on this role with Phoenix as a fast way to directly >>> query the data. I can see Kudu as the best way to fill this gap easily, >>> especially being the closest thing to other relational databases out there >>> in familiarity for the many SQL analytics people in our company. The other >>> alternative would be to go with AWS Redshift for the same reasons, but it >>> would come at a cost, of course. If we went with either solutions, Kudu or >>> Redshift, it would get rid of the need to extract from HBase to parquet >>> tables or export to PostgreSQL to support more of the SQL language using by >>> analysts or the reporting software we use.. >>> >>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. >>> Have you folks tried Kudu with Impala yet with those use cases? >>> >>> >>> I hope this helps. >>> >>> It does, thanks for nice reply. >>> >>> >>> Cheers, >>> Ben >>> >>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>> <mailto:jdcry...@apache.org>> wrote: >>>> >>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to >>>> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who >>>> were also there did say that the hype around Spark isn't dying down. >>>> >>>> There's definitely an overlap in the use cases that Cassandra, HBase, and >>>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim >>>> solution for the use case you describe. >>>> >>>> Nothing significant happened in Kudu over the past month, it's a storage >>>> engine so things move slowly *smile*. I'd love to see more contributions >>>> on the Spark front. I know there's code out there that could be integrated >>>> in kudu-spark, it just needs to land in gerrit. I'm sure folks will >>>> happily review it. >>>> >>>> Do you have relevant experiences you can share? I'd love to learn more >>>> about the use cases for which you envision using Kudu as a C* replacement. >>>> >>>> Thanks, >>>> >>>> J-D >>>> >>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> Hi J-D, >>>> >>>> My colleagues recently came back from Strata in San Jose. They told me >>>> that everything was about Spark and there is a big buzz about the SMACK >>>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra >>>> is just an interim solution as a low-latency, easily queried data store. I >>>> was wondering if anything significant happened in regards to Kudu, >>>> especially on the Spark front. Plus, can you come up with your own >>>> proposed stack acronym to promote? >>>> >>>> Cheers, >>>> Ben >>>> >>>> >>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>> <mailto:jdcry...@apache.org>> wrote: >>>>> >>>>> Hi Ben, >>>>> >>>>> AFAIK no one in the dev community committed to any timeline. I know of >>>>> one person on the Kudu Slack who's working on a better RDD, but that's >>>>> about it. >>>>> >>>>> Regards, >>>>> >>>>> J-D >>>>> >>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com >>>>> <mailto:b...@amobee.com>> wrote: >>>>> Hi J-D, >>>>> >>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a version >>>>> of Kudu to begin real testing of Spark against it for our devs. At least, >>>>> I can tell them what timeframe to anticipate. >>>>> >>>>> Just curious, >>>>> Benjamin Kim >>>>> Data Solutions Architect >>>>> >>>>> [a•mo•bee] (n.) the company defining digital marketing. >>>>> >>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900> >>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>>> www.amobee.com <http://www.amobee.com/> >>>>> >>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>> >>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed either. >>>>>> >>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use >>>>>> scans directly. >>>>>> >>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's >>>>>> really basic. >>>>>> >>>>>> The goal was to provide something for others to contribute to. We have >>>>>> some basic unit tests that others can easily extend. None of us on the >>>>>> team are Spark experts, but we'd be really happy to assist one improve >>>>>> the kudu-spark code. >>>>>> >>>>>> J-D >>>>>> >>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> J-D, >>>>>> >>>>>> It looks like it fulfills most of the basic requirements (kudu RDD, kudu >>>>>> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL >>>>>> functionality (Dataframes) and doing the documentation, what more needs >>>>>> to be done? Optimizations? >>>>>> >>>>>> I believe that it’s a good place to start using Spark with Kudu and >>>>>> compare it to HBase with Spark (not clean). >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> >>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>> >>>>>>> AFAIK no one is working on it, but we did manage to get this in for >>>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>>> <https://issues.cloudera.org/browse/KUDU-1321> >>>>>>> >>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but >>>>>>> it will require a lot more work to make it fast/useful. >>>>>>> >>>>>>> Hope this helps, >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> >>>>>>> targeted for 0.8.0, but I see no progress on it. When this is complete, >>>>>>> will this mean that Spark will be able to work with Kudu both >>>>>>> programmatically and as a client via Spark SQL? Or is there more work >>>>>>> that needs to be done on the Spark side for it to work? >>>>>>> >>>>>>> Just curious. >>>>>>> >>>>>>> Cheers, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> >> >> >> >