Re: Spark on Kudu

Benjamin Kim Tue, 12 Apr 2016 16:20:12 -0700

It would be nice to adhere to the SQL:2003 standard for an “upsert” if it were 
to be implemented.


MERGE INTO table_name USING table_reference ON (condition)
 WHEN MATCHED THEN
 UPDATE SET column1 = value1 [, column2 = value2 ...]
 WHEN NOT MATCHED THEN
 INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])

Cheers,
Ben

> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com> wrote:
> 
> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit if 
> you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
> <http://gerrit.cloudera.org:8080/#/c/2754/>
> It does pushdown predicates which the existing input formatter based rdd does 
> not.
> 
> Within the next two weeks I’m planning to implement a datasource for spark 
> that will have pushdown predicates and insertion/update functionality (need 
> to look more at cassandra and the hbase datasource for best way to do this) I 
> agree that server side upsert would be helpful.
> Having a datasource would give us useful data frames and also make spark sql 
> usable for kudu.
> 
> My reasoning for having a spark datasource and not using Impala is: 1. We 
> have had trouble getting impala to run fast with high concurrency when 
> compared to spark 2. We interact with datasources which do not integrate with 
> impala. 3. We have custom sql query planners for extended sql functionality.
> 
> -Chris George
> 
> 
> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
> <mailto:jdcry...@apache.org>> wrote:
> 
> You guys make a convincing point, although on the upsert side we'll need more 
> support from the servers. Right now all you can do is an INSERT then, if you 
> get a dup key, do an UPDATE. I guess we could at least add an API on the 
> client side that would manage it, but it wouldn't be atomic.
> 
> J-D
> 
> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
> <mailto:m...@clearstorydata.com>> wrote:
> It's pretty simple, actually.  I need to support versioned datasets in a 
> Spark SQL environment.  Instead of a hack on top of a Parquet data store, I'm 
> hoping (among other reasons) to be able to use Kudu's write and 
> timestamp-based read operations to support not only appending data, but also 
> updating existing data, and even some schema migration.  The most typical use 
> case is a dataset that is updated periodically (e.g., weekly or monthly) in 
> which the the preliminary data in the previous window (week or month) is 
> updated with values that are expected to remain unchanged from then on, and a 
> new set of preliminary values for the current window need to be 
> added/appended.
> 
> Using Kudu's Java API and developing additional functionality on top of what 
> Kudu has to offer isn't too much to ask, but the ease of integration with 
> Spark SQL will gate how quickly we would move to using Kudu and how seriously 
> we'd look at alternatives before making that decision. 
> 
> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org 
> <mailto:jdcry...@apache.org>> wrote:
> Mark,
> 
> Thanks for taking some time to reply in this thread, glad it caught the 
> attention of other folks!
> 
> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com 
> <mailto:m...@clearstorydata.com>> wrote:
> Do they care being able to insert into Kudu with SparkSQL
> 
> I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
> refactoring of some Spark SQL-oriented insert functionality while trying to 
> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting 
> inserts with Spark SQL will be a key consideration as to whether we adopt 
> Kudu.
> 
> I'd like to know more about why SparkSQL inserts in necessary for you. Is it 
> just that you currently do it that way into some database or parquet so with 
> minimal refactoring you'd be able to use Kudu? Would re-writing those SQL 
> lines into Scala and directly use the Java API's KuduSession be too much work?
> 
> Additionally, what do you expect to gain from using Kudu VS your current 
> solution? If it's not completely clear, I'd love to help you think through it.
>  
> 
> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcry...@apache.org 
> <mailto:jdcry...@apache.org>> wrote:
> Yup, starting to get a good idea.
> 
> What are your DS folks looking for in terms of functionality related to 
> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they 
> care being able to insert into Kudu with SparkSQL or just being able to query 
> real fast? Anything more specific to Spark that I'm missing?
> 
> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all our 
> resources are committed to making things happen in time, and a more fully 
> featured Spark integration isn't in our plans during that period. I'm really 
> hoping someone in the community will help with Spark, the same way we got a 
> big contribution for the Flume sink. 
> 
> J-D
> 
> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since it’s 
> not “production-ready”, upper management doesn’t want to fully deploy it yet. 
> They just want to keep an eye on it though. Kudu was so much simpler and 
> easier to use in every aspect compared to HBase. Impala was great for the 
> report writers and analysts to experiment with for the short time it was up. 
> But, once again, the only blocker was the lack of Spark support for our Data 
> Developers/Scientists. So, production-level data population won’t happen 
> until then.
> 
> I hope this helps you get an idea where I am coming from…
> 
> Cheers,
> Ben
> 
> 
>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> J-D,
>> 
>> The main thing I hear that Cassandra is being used as an updatable hot data 
>> store to ensure that duplicates are taken care of and idempotency is 
>> maintained. Whether data was directly retrieved from Cassandra for 
>> analytics, reports, or searches, it was not clear as to what was its main 
>> use. Some also just used it for a staging area to populate downstream tables 
>> in parquet format. The last thing I heard was that CQL was terrible, so that 
>> rules out much use of direct queries against it.
>> 
>> I'm no C* expert, but I don't think CQL is meant for real analytics, just 
>> ease of use instead of plainly using the APIs. Even then, Kudu should beat 
>> it easily on big scans. Same for HBase. We've done benchmarks against the 
>> latter, not the former.
>>  
>> 
>> As for our company, we have been looking for an updatable data store for a 
>> long time that can be quickly queried directly either using Spark SQL or 
>> Impala or some other SQL engine and still handle TB or PB of data without 
>> performance degradation and many configuration headaches. For now, we are 
>> using HBase to take on this role with Phoenix as a fast way to directly 
>> query the data. I can see Kudu as the best way to fill this gap easily, 
>> especially being the closest thing to other relational databases out there 
>> in familiarity for the many SQL analytics people in our company. The other 
>> alternative would be to go with AWS Redshift for the same reasons, but it 
>> would come at a cost, of course. If we went with either solutions, Kudu or 
>> Redshift, it would get rid of the need to extract from HBase to parquet 
>> tables or export to PostgreSQL to support more of the SQL language using by 
>> analysts or the reporting software we use..
>> 
>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. Have 
>> you folks tried Kudu with Impala yet with those use cases?
>>  
>> 
>> I hope this helps.
>> 
>> It does, thanks for nice reply.
>>  
>> 
>> Cheers,
>> Ben 
>> 
>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to 
>>> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who 
>>> were also there did say that the hype around Spark isn't dying down.
>>> 
>>> There's definitely an overlap in the use cases that Cassandra, HBase, and 
>>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim 
>>> solution for the use case you describe.
>>> 
>>> Nothing significant happened in Kudu over the past month, it's a storage 
>>> engine so things move slowly *smile*. I'd love to see more contributions on 
>>> the Spark front. I know there's code out there that could be integrated in 
>>> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily 
>>> review it.
>>> 
>>> Do you have relevant experiences you can share? I'd love to learn more 
>>> about the use cases for which you envision using Kudu as a C* replacement.
>>> 
>>> Thanks,
>>> 
>>> J-D
>>> 
>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi J-D,
>>> 
>>> My colleagues recently came back from Strata in San Jose. They told me that 
>>> everything was about Spark and there is a big buzz about the SMACK stack 
>>> (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra is 
>>> just an interim solution as a low-latency, easily queried data store. I was 
>>> wondering if anything significant happened in regards to Kudu, especially 
>>> on the Spark front. Plus, can you come up with your own proposed stack 
>>> acronym to promote?
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> Hi Ben,
>>>> 
>>>> AFAIK no one in the dev community committed to any timeline. I know of one 
>>>> person on the Kudu Slack who's working on a better RDD, but that's about 
>>>> it.
>>>> 
>>>> Regards,
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>> <mailto:b...@amobee.com>> wrote:
>>>> Hi J-D,
>>>> 
>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a version 
>>>> of Kudu to begin real testing of Spark against it for our devs. At least, 
>>>> I can tell them what timeframe to anticipate.
>>>> 
>>>> Just curious,
>>>> Benjamin Kim
>>>> Data Solutions Architect
>>>> 
>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>> 
>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>> www.amobee.com <http://www.amobee.com/>
>>>> 
>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> 
>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed either.
>>>>> 
>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use 
>>>>> scans directly.
>>>>> 
>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's 
>>>>> really basic.
>>>>> 
>>>>> The goal was to provide something for others to contribute to. We have 
>>>>> some basic unit tests that others can easily extend. None of us on the 
>>>>> team are Spark experts, but we'd be really happy to assist one improve 
>>>>> the kudu-spark code.
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> J-D,
>>>>> 
>>>>> It looks like it fulfills most of the basic requirements (kudu RDD, kudu 
>>>>> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL 
>>>>> functionality (Dataframes) and doing the documentation, what more needs 
>>>>> to be done? Optimizations?
>>>>> 
>>>>> I believe that it’s a good place to start using Spark with Kudu and 
>>>>> compare it to HBase with Spark (not clean).
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>> 
>>>>>> AFAIK no one is working on it, but we did manage to get this in for 
>>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>> 
>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but 
>>>>>> it will require a lot more work to make it fast/useful.
>>>>>> 
>>>>>> Hope this helps,
>>>>>> 
>>>>>> J-D
>>>>>> 
>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> 
>>>>>> targeted for 0.8.0, but I see no progress on it. When this is complete, 
>>>>>> will this mean that Spark will be able to work with Kudu both 
>>>>>> programmatically and as a client via Spark SQL? Or is there more work 
>>>>>> that needs to be done on the Spark side for it to work?
>>>>>> 
>>>>>> Just curious.
>>>>>> 
>>>>>> Cheers,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> 
>

Re: Spark on Kudu

Reply via email to