Re: Spark on Kudu

Benjamin Kim Wed, 13 Apr 2016 08:32:12 -0700

Chris,

That would be great! And a first! I think everyone would take notice if KImpala 
had this.


Cheers,
Ben


> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com> wrote:
> 
> SparkSQL cannot support these type of statements but we may be able to 
> implement similar functionality through the api.
> -Chris
> 
> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
> were to be implemented.
> 
> MERGE INTO table_name USING table_reference ON (condition)
>  WHEN MATCHED THEN
>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>  WHEN NOT MATCHED THEN
>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
> 
> Cheers,
> Ben
> 
>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com 
>> <mailto:christopher.geo...@rms.com>> wrote:
>> 
>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>> It does pushdown predicates which the existing input formatter based rdd 
>> does not.
>> 
>> Within the next two weeks I’m planning to implement a datasource for spark 
>> that will have pushdown predicates and insertion/update functionality (need 
>> to look more at cassandra and the hbase datasource for best way to do this) 
>> I agree that server side upsert would be helpful.
>> Having a datasource would give us useful data frames and also make spark sql 
>> usable for kudu.
>> 
>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>> have had trouble getting impala to run fast with high concurrency when 
>> compared to spark 2. We interact with datasources which do not integrate 
>> with impala. 3. We have custom sql query planners for extended sql 
>> functionality.
>> 
>> -Chris George
>> 
>> 
>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> You guys make a convincing point, although on the upsert side we'll need 
>> more support from the servers. Right now all you can do is an INSERT then, 
>> if you get a dup key, do an UPDATE. I guess we could at least add an API on 
>> the client side that would manage it, but it wouldn't be atomic.
>> 
>> J-D
>> 
>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
>> <mailto:m...@clearstorydata.com>> wrote:
>> It's pretty simple, actually.  I need to support versioned datasets in a 
>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>> timestamp-based read operations to support not only appending data, but also 
>> updating existing data, and even some schema migration.  The most typical 
>> use case is a dataset that is updated periodically (e.g., weekly or monthly) 
>> in which the the preliminary data in the previous window (week or month) is 
>> updated with values that are expected to remain unchanged from then on, and 
>> a new set of preliminary values for the current window need to be 
>> added/appended.
>> 
>> Using Kudu's Java API and developing additional functionality on top of what 
>> Kudu has to offer isn't too much to ask, but the ease of integration with 
>> Spark SQL will gate how quickly we would move to using Kudu and how 
>> seriously we'd look at alternatives before making that decision. 
>> 
>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> Mark,
>> 
>> Thanks for taking some time to reply in this thread, glad it caught the 
>> attention of other folks!
>> 
>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com 
>> <mailto:m...@clearstorydata.com>> wrote:
>> Do they care being able to insert into Kudu with SparkSQL
>> 
>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
>> refactoring of some Spark SQL-oriented insert functionality while trying to 
>> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting 
>> inserts with Spark SQL will be a key consideration as to whether we adopt 
>> Kudu.
>> 
>> I'd like to know more about why SparkSQL inserts in necessary for you. Is it 
>> just that you currently do it that way into some database or parquet so with 
>> minimal refactoring you'd be able to use Kudu? Would re-writing those SQL 
>> lines into Scala and directly use the Java API's KuduSession be too much 
>> work?
>> 
>> Additionally, what do you expect to gain from using Kudu VS your current 
>> solution? If it's not completely clear, I'd love to help you think through 
>> it.
>>  
>> 
>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> Yup, starting to get a good idea.
>> 
>> What are your DS folks looking for in terms of functionality related to 
>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they 
>> care being able to insert into Kudu with SparkSQL or just being able to 
>> query real fast? Anything more specific to Spark that I'm missing?
>> 
>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all 
>> our resources are committed to making things happen in time, and a more 
>> fully featured Spark integration isn't in our plans during that period. I'm 
>> really hoping someone in the community will help with Spark, the same way we 
>> got a big contribution for the Flume sink. 
>> 
>> J-D
>> 
>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since it’s 
>> not “production-ready”, upper management doesn’t want to fully deploy it 
>> yet. They just want to keep an eye on it though. Kudu was so much simpler 
>> and easier to use in every aspect compared to HBase. Impala was great for 
>> the report writers and analysts to experiment with for the short time it was 
>> up. But, once again, the only blocker was the lack of Spark support for our 
>> Data Developers/Scientists. So, production-level data population won’t 
>> happen until then.
>> 
>> I hope this helps you get an idea where I am coming from…
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> J-D,
>>> 
>>> The main thing I hear that Cassandra is being used as an updatable hot data 
>>> store to ensure that duplicates are taken care of and idempotency is 
>>> maintained. Whether data was directly retrieved from Cassandra for 
>>> analytics, reports, or searches, it was not clear as to what was its main 
>>> use. Some also just used it for a staging area to populate downstream 
>>> tables in parquet format. The last thing I heard was that CQL was terrible, 
>>> so that rules out much use of direct queries against it.
>>> 
>>> I'm no C* expert, but I don't think CQL is meant for real analytics, just 
>>> ease of use instead of plainly using the APIs. Even then, Kudu should beat 
>>> it easily on big scans. Same for HBase. We've done benchmarks against the 
>>> latter, not the former.
>>>  
>>> 
>>> As for our company, we have been looking for an updatable data store for a 
>>> long time that can be quickly queried directly either using Spark SQL or 
>>> Impala or some other SQL engine and still handle TB or PB of data without 
>>> performance degradation and many configuration headaches. For now, we are 
>>> using HBase to take on this role with Phoenix as a fast way to directly 
>>> query the data. I can see Kudu as the best way to fill this gap easily, 
>>> especially being the closest thing to other relational databases out there 
>>> in familiarity for the many SQL analytics people in our company. The other 
>>> alternative would be to go with AWS Redshift for the same reasons, but it 
>>> would come at a cost, of course. If we went with either solutions, Kudu or 
>>> Redshift, it would get rid of the need to extract from HBase to parquet 
>>> tables or export to PostgreSQL to support more of the SQL language using by 
>>> analysts or the reporting software we use..
>>> 
>>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. 
>>> Have you folks tried Kudu with Impala yet with those use cases?
>>>  
>>> 
>>> I hope this helps.
>>> 
>>> It does, thanks for nice reply.
>>>  
>>> 
>>> Cheers,
>>> Ben 
>>> 
>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to 
>>>> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who 
>>>> were also there did say that the hype around Spark isn't dying down.
>>>> 
>>>> There's definitely an overlap in the use cases that Cassandra, HBase, and 
>>>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim 
>>>> solution for the use case you describe.
>>>> 
>>>> Nothing significant happened in Kudu over the past month, it's a storage 
>>>> engine so things move slowly *smile*. I'd love to see more contributions 
>>>> on the Spark front. I know there's code out there that could be integrated 
>>>> in kudu-spark, it just needs to land in gerrit. I'm sure folks will 
>>>> happily review it.
>>>> 
>>>> Do you have relevant experiences you can share? I'd love to learn more 
>>>> about the use cases for which you envision using Kudu as a C* replacement.
>>>> 
>>>> Thanks,
>>>> 
>>>> J-D
>>>> 
>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi J-D,
>>>> 
>>>> My colleagues recently came back from Strata in San Jose. They told me 
>>>> that everything was about Spark and there is a big buzz about the SMACK 
>>>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra 
>>>> is just an interim solution as a low-latency, easily queried data store. I 
>>>> was wondering if anything significant happened in regards to Kudu, 
>>>> especially on the Spark front. Plus, can you come up with your own 
>>>> proposed stack acronym to promote?
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> 
>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> 
>>>>> Hi Ben,
>>>>> 
>>>>> AFAIK no one in the dev community committed to any timeline. I know of 
>>>>> one person on the Kudu Slack who's working on a better RDD, but that's 
>>>>> about it.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>>> <mailto:b...@amobee.com>> wrote:
>>>>> Hi J-D,
>>>>> 
>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a version 
>>>>> of Kudu to begin real testing of Spark against it for our devs. At least, 
>>>>> I can tell them what timeframe to anticipate.
>>>>> 
>>>>> Just curious,
>>>>> Benjamin Kim
>>>>> Data Solutions Architect
>>>>> 
>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>> 
>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>> www.amobee.com <http://www.amobee.com/>
>>>>> 
>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>> 
>>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed either.
>>>>>> 
>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use 
>>>>>> scans directly.
>>>>>> 
>>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's 
>>>>>> really basic.
>>>>>> 
>>>>>> The goal was to provide something for others to contribute to. We have 
>>>>>> some basic unit tests that others can easily extend. None of us on the 
>>>>>> team are Spark experts, but we'd be really happy to assist one improve 
>>>>>> the kudu-spark code.
>>>>>> 
>>>>>> J-D
>>>>>> 
>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> J-D,
>>>>>> 
>>>>>> It looks like it fulfills most of the basic requirements (kudu RDD, kudu 
>>>>>> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL 
>>>>>> functionality (Dataframes) and doing the documentation, what more needs 
>>>>>> to be done? Optimizations?
>>>>>> 
>>>>>> I believe that it’s a good place to start using Spark with Kudu and 
>>>>>> compare it to HBase with Spark (not clean).
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>> 
>>>>>>> AFAIK no one is working on it, but we did manage to get this in for 
>>>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>> 
>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but 
>>>>>>> it will require a lot more work to make it fast/useful.
>>>>>>> 
>>>>>>> Hope this helps,
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> 
>>>>>>> targeted for 0.8.0, but I see no progress on it. When this is complete, 
>>>>>>> will this mean that Spark will be able to work with Kudu both 
>>>>>>> programmatically and as a client via Spark SQL? Or is there more work 
>>>>>>> that needs to be done on the Spark side for it to work?
>>>>>>> 
>>>>>>> Just curious.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>

Re: Spark on Kudu

Reply via email to