It would be nice to adhere to the SQL:2003 standard for an “upsert” if it were to be implemented.
MERGE INTO table_name USING table_reference ON (condition) WHEN MATCHED THEN UPDATE SET column1 = value1 [, column2 = value2 ...] WHEN NOT MATCHED THEN INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) Cheers, Ben > On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com> wrote: > > I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit if > you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ > <http://gerrit.cloudera.org:8080/#/c/2754/> > It does pushdown predicates which the existing input formatter based rdd does > not. > > Within the next two weeks I’m planning to implement a datasource for spark > that will have pushdown predicates and insertion/update functionality (need > to look more at cassandra and the hbase datasource for best way to do this) I > agree that server side upsert would be helpful. > Having a datasource would give us useful data frames and also make spark sql > usable for kudu. > > My reasoning for having a spark datasource and not using Impala is: 1. We > have had trouble getting impala to run fast with high concurrency when > compared to spark 2. We interact with datasources which do not integrate with > impala. 3. We have custom sql query planners for extended sql functionality. > > -Chris George > > > On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org > <mailto:jdcry...@apache.org>> wrote: > > You guys make a convincing point, although on the upsert side we'll need more > support from the servers. Right now all you can do is an INSERT then, if you > get a dup key, do an UPDATE. I guess we could at least add an API on the > client side that would manage it, but it wouldn't be atomic. > > J-D > > On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com > <mailto:m...@clearstorydata.com>> wrote: > It's pretty simple, actually. I need to support versioned datasets in a > Spark SQL environment. Instead of a hack on top of a Parquet data store, I'm > hoping (among other reasons) to be able to use Kudu's write and > timestamp-based read operations to support not only appending data, but also > updating existing data, and even some schema migration. The most typical use > case is a dataset that is updated periodically (e.g., weekly or monthly) in > which the the preliminary data in the previous window (week or month) is > updated with values that are expected to remain unchanged from then on, and a > new set of preliminary values for the current window need to be > added/appended. > > Using Kudu's Java API and developing additional functionality on top of what > Kudu has to offer isn't too much to ask, but the ease of integration with > Spark SQL will gate how quickly we would move to using Kudu and how seriously > we'd look at alternatives before making that decision. > > On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org > <mailto:jdcry...@apache.org>> wrote: > Mark, > > Thanks for taking some time to reply in this thread, glad it caught the > attention of other folks! > > On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com > <mailto:m...@clearstorydata.com>> wrote: > Do they care being able to insert into Kudu with SparkSQL > > I care about insert into Kudu with Spark SQL. I'm currently delaying a > refactoring of some Spark SQL-oriented insert functionality while trying to > evaluate what to expect from Kudu. Whether Kudu does a good job supporting > inserts with Spark SQL will be a key consideration as to whether we adopt > Kudu. > > I'd like to know more about why SparkSQL inserts in necessary for you. Is it > just that you currently do it that way into some database or parquet so with > minimal refactoring you'd be able to use Kudu? Would re-writing those SQL > lines into Scala and directly use the Java API's KuduSession be too much work? > > Additionally, what do you expect to gain from using Kudu VS your current > solution? If it's not completely clear, I'd love to help you think through it. > > > On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcry...@apache.org > <mailto:jdcry...@apache.org>> wrote: > Yup, starting to get a good idea. > > What are your DS folks looking for in terms of functionality related to > Spark? A SparkSQL integration that's as fully featured as Impala's? Do they > care being able to insert into Kudu with SparkSQL or just being able to query > real fast? Anything more specific to Spark that I'm missing? > > FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all our > resources are committed to making things happen in time, and a more fully > featured Spark integration isn't in our plans during that period. I'm really > hoping someone in the community will help with Spark, the same way we got a > big contribution for the Flume sink. > > J-D > > On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since it’s > not “production-ready”, upper management doesn’t want to fully deploy it yet. > They just want to keep an eye on it though. Kudu was so much simpler and > easier to use in every aspect compared to HBase. Impala was great for the > report writers and analysts to experiment with for the short time it was up. > But, once again, the only blocker was the lack of Spark support for our Data > Developers/Scientists. So, production-level data population won’t happen > until then. > > I hope this helps you get an idea where I am coming from… > > Cheers, > Ben > > >> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org >> <mailto:jdcry...@apache.org>> wrote: >> >> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> J-D, >> >> The main thing I hear that Cassandra is being used as an updatable hot data >> store to ensure that duplicates are taken care of and idempotency is >> maintained. Whether data was directly retrieved from Cassandra for >> analytics, reports, or searches, it was not clear as to what was its main >> use. Some also just used it for a staging area to populate downstream tables >> in parquet format. The last thing I heard was that CQL was terrible, so that >> rules out much use of direct queries against it. >> >> I'm no C* expert, but I don't think CQL is meant for real analytics, just >> ease of use instead of plainly using the APIs. Even then, Kudu should beat >> it easily on big scans. Same for HBase. We've done benchmarks against the >> latter, not the former. >> >> >> As for our company, we have been looking for an updatable data store for a >> long time that can be quickly queried directly either using Spark SQL or >> Impala or some other SQL engine and still handle TB or PB of data without >> performance degradation and many configuration headaches. For now, we are >> using HBase to take on this role with Phoenix as a fast way to directly >> query the data. I can see Kudu as the best way to fill this gap easily, >> especially being the closest thing to other relational databases out there >> in familiarity for the many SQL analytics people in our company. The other >> alternative would be to go with AWS Redshift for the same reasons, but it >> would come at a cost, of course. If we went with either solutions, Kudu or >> Redshift, it would get rid of the need to extract from HBase to parquet >> tables or export to PostgreSQL to support more of the SQL language using by >> analysts or the reporting software we use.. >> >> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. Have >> you folks tried Kudu with Impala yet with those use cases? >> >> >> I hope this helps. >> >> It does, thanks for nice reply. >> >> >> Cheers, >> Ben >> >>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >>> <mailto:jdcry...@apache.org>> wrote: >>> >>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to >>> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who >>> were also there did say that the hype around Spark isn't dying down. >>> >>> There's definitely an overlap in the use cases that Cassandra, HBase, and >>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim >>> solution for the use case you describe. >>> >>> Nothing significant happened in Kudu over the past month, it's a storage >>> engine so things move slowly *smile*. I'd love to see more contributions on >>> the Spark front. I know there's code out there that could be integrated in >>> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily >>> review it. >>> >>> Do you have relevant experiences you can share? I'd love to learn more >>> about the use cases for which you envision using Kudu as a C* replacement. >>> >>> Thanks, >>> >>> J-D >>> >>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Hi J-D, >>> >>> My colleagues recently came back from Strata in San Jose. They told me that >>> everything was about Spark and there is a big buzz about the SMACK stack >>> (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra is >>> just an interim solution as a low-latency, easily queried data store. I was >>> wondering if anything significant happened in regards to Kudu, especially >>> on the Spark front. Plus, can you come up with your own proposed stack >>> acronym to promote? >>> >>> Cheers, >>> Ben >>> >>> >>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>> <mailto:jdcry...@apache.org>> wrote: >>>> >>>> Hi Ben, >>>> >>>> AFAIK no one in the dev community committed to any timeline. I know of one >>>> person on the Kudu Slack who's working on a better RDD, but that's about >>>> it. >>>> >>>> Regards, >>>> >>>> J-D >>>> >>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com >>>> <mailto:b...@amobee.com>> wrote: >>>> Hi J-D, >>>> >>>> Quick question… Is there an ETA for KUDU-1214? I want to target a version >>>> of Kudu to begin real testing of Spark against it for our devs. At least, >>>> I can tell them what timeframe to anticipate. >>>> >>>> Just curious, >>>> Benjamin Kim >>>> Data Solutions Architect >>>> >>>> [a•mo•bee] (n.) the company defining digital marketing. >>>> >>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900> >>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>> www.amobee.com <http://www.amobee.com/> >>>> >>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>> <mailto:jdcry...@apache.org>> wrote: >>>>> >>>>> The DStream stuff isn't there at all. I'm not sure if it's needed either. >>>>> >>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use >>>>> scans directly. >>>>> >>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's >>>>> really basic. >>>>> >>>>> The goal was to provide something for others to contribute to. We have >>>>> some basic unit tests that others can easily extend. None of us on the >>>>> team are Spark experts, but we'd be really happy to assist one improve >>>>> the kudu-spark code. >>>>> >>>>> J-D >>>>> >>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> J-D, >>>>> >>>>> It looks like it fulfills most of the basic requirements (kudu RDD, kudu >>>>> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL >>>>> functionality (Dataframes) and doing the documentation, what more needs >>>>> to be done? Optimizations? >>>>> >>>>> I believe that it’s a good place to start using Spark with Kudu and >>>>> compare it to HBase with Spark (not clean). >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>> >>>>>> AFAIK no one is working on it, but we did manage to get this in for >>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>> <https://issues.cloudera.org/browse/KUDU-1321> >>>>>> >>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but >>>>>> it will require a lot more work to make it fast/useful. >>>>>> >>>>>> Hope this helps, >>>>>> >>>>>> J-D >>>>>> >>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> >>>>>> targeted for 0.8.0, but I see no progress on it. When this is complete, >>>>>> will this mean that Spark will be able to work with Kudu both >>>>>> programmatically and as a client via Spark SQL? Or is there more work >>>>>> that needs to be done on the Spark side for it to work? >>>>>> >>>>>> Just curious. >>>>>> >>>>>> Cheers, >>>>>> Ben >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > > > > >