Re: Spark on Kudu

Benjamin Kim Tue, 20 Sep 2016 15:01:16 -0700

I see that the API has changed a bit so my old code doesn’t work anymore. Can 
someone direct me to some code samples?


Thanks,
Ben

> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Now that Kudu 1.0.0 is officially out and ready for production use, where do 
> we find the spark connector jar for this release?
> 
> 
> It's available in the official ASF maven repository:  
> https://repository.apache.org/#nexus-search;quick~kudu-spark 
> <https://repository.apache.org/#nexus-search;quick~kudu-spark>
> 
> <dependency>
>   <groupId>org.apache.kudu</groupId>
>   <artifactId>kudu-spark_2.10</artifactId>
>   <version>1.0.0</version>
> </dependency>
> 
> 
> -Todd
>  
> 
> 
>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Hi Ben,
>> 
>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
>> not think we support that at this point.  I haven't looked deeply into it, 
>> but we may hit issues specifying Kudu-specific options (partitioning, column 
>> encoding, etc.).  Probably issues that can be worked through eventually, 
>> though.  If you are interested in contributing to Kudu, this is an area that 
>> could obviously use improvement!  Most or all of our Spark features have 
>> been completely community driven to date.
>>  
>> I am assuming that more Spark support along with semantic changes below will 
>> be incorporated into Kudu 0.9.1.
>> 
>> As a rule we do not release new features in patch releases, but the good 
>> news is that we are releasing regularly, and our next scheduled release is 
>> for the August timeframe (see JD's roadmap 
>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>>  email about what we are aiming to include).  Also, Cloudera does publish 
>> snapshot versions of the Spark connector here 
>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so 
>> the jars are available if you don't mind using snapshots.
>>  
>> Anyone know of a better way to make unique primary keys other than using 
>> UUID to make every row unique if there is no unique column (or combination 
>> thereof) to use.
>> 
>> Not that I know of.  In general it's pretty rare to have a dataset without a 
>> natural primary key (even if it's just all of the columns), but in those 
>> cases UUID is a good solution.
>>  
>> This is what I am using. I know auto incrementing is coming down the line 
>> (don’t know when), but is there a way to simulate this in Kudu using Spark 
>> out of curiosity?
>> 
>> To my knowledge there is no plan to have auto increment in Kudu.  
>> Distributed, consistent, auto incrementing counters is a difficult problem, 
>> and I don't think there are any known solutions that would be fast enough 
>> for Kudu (happy to be proven wrong, though!).
>> 
>> - Dan
>>  
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> I'm not sure exactly what the semantics will be, but at least one of them 
>>> will be upsert.  These modes come from spark, and they were really designed 
>>> for file-backed storage and not table storage.  We may want to do append = 
>>> upsert, and overwrite = truncate + insert.  I think that may match the 
>>> normal spark semantics more closely.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Dan,
>>> 
>>> Thanks for the information. That would mean both “append” and “overwrite” 
>>> modes would be combined or not needed in the future.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com 
>>>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> Right now append uses an update Kudu operation, which requires the row 
>>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>>> better, since upsert is the way to go for most spark workloads.
>>>> 
>>>> - Dan
>>>> 
>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>>> data. Now, I have to find answers to these questions. What would happen if 
>>>> I “append” to the data in the Kudu table if the data already exists? What 
>>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>>> is.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> 
>>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Now, I’m getting this error when trying to write to the table.
>>>>> 
>>>>> import scala.collection.JavaConverters._
>>>>> val key_seq = Seq(“my_id")
>>>>> val key_list = List(“my_id”).asJava
>>>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>>>> 
>>>>> df.write
>>>>>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>>>>     .mode("overwrite")
>>>>>     .kudu
>>>>> 
>>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>>>> (error 0)Not found: key not found (error 0)
>>>>> 
>>>>> Does the key field need to be first in the DataFrame?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com 
>>>>>> <mailto:d...@cloudera.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> Dan,
>>>>>> 
>>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>>>> 
>>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>>>> 
>>>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>>>> using setRangePartitionColumns or addHashPartitions
>>>>>> 
>>>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>>>> primary key columns, so in this case you have specified the single PK 
>>>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>>>> the table, in this case over the column "my_id" (which is good, it must 
>>>>>> be over one or more PK columns, so in this case "my_id" is the one and 
>>>>>> only valid combination).  However, the call to `addHashPartition` also 
>>>>>> takes the number of buckets as the second param.  You shouldn't get the 
>>>>>> IllegalArgumentException as long as you are specifying either 
>>>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>>>> 
>>>>>> - Dan
>>>>>>  
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com 
>>>>>>> <mailto:d...@cloudera.com>> wrote:
>>>>>>> 
>>>>>>> Looks like we're missing an import statement in that example.  Could 
>>>>>>> you try:
>>>>>>> 
>>>>>>> import org.kududb.client._
>>>>>>> and try again?
>>>>>>> 
>>>>>>> - Dan
>>>>>>> 
>>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> I encountered an error trying to create a table based on the 
>>>>>>> documentation from a DataFrame.
>>>>>>> 
>>>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>>>               kuduContext.createTable(tableName, df.schema, Seq("key"), 
>>>>>>> new CreateTableOptions().setNumReplicas(1))
>>>>>>> 
>>>>>>> Is there something I’m missing?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>> 
>>>>>>>> It's only in Cloudera's maven repo: 
>>>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>>>  
>>>>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>>>>>> 
>>>>>>>> J-D
>>>>>>>> 
>>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>> Hi J-D,
>>>>>>>> 
>>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar 
>>>>>>>> for spark-shell to use. Can you show me where to find it?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>> 
>>>>>>>>> What's in this doc is what's gonna get released: 
>>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>>>>>  
>>>>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>>>>>>> 
>>>>>>>>> J-D
>>>>>>>>> 
>>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>> 
>>>>>>>>>> It will be in 0.9.0.
>>>>>>>>>> 
>>>>>>>>>> J-D
>>>>>>>>>> 
>>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>> Hi Chris,
>>>>>>>>>> 
>>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George 
>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> There is some code in review that needs some more refinement.
>>>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource 
>>>>>>>>>>> api. It will also allow the creation and deletion of tables from a 
>>>>>>>>>>> dataframe
>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>>>>>>> 
>>>>>>>>>>> Example usages will look something like:
>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>>>>>>> 
>>>>>>>>>>> -Chris George
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>>>>>> 
>>>>>>>>>>> Also, does anyone have any sample code on how to update/insert data 
>>>>>>>>>>> in Kudu using DataFrames?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George 
>>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> SparkSQL cannot support these type of statements but we may be 
>>>>>>>>>>>> able to implement similar functionality through the api.
>>>>>>>>>>>> -Chris
>>>>>>>>>>>> 
>>>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an 
>>>>>>>>>>>> “upsert” if it were to be implemented.
>>>>>>>>>>>> 
>>>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George 
>>>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it 
>>>>>>>>>>>>> into gerrit if you want to take a look. 
>>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>>>>>>> It does pushdown predicates which the existing input formatter 
>>>>>>>>>>>>> based rdd does not.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource 
>>>>>>>>>>>>> for spark that will have pushdown predicates and insertion/update 
>>>>>>>>>>>>> functionality (need to look more at cassandra and the hbase 
>>>>>>>>>>>>> datasource for best way to do this) I agree that server side 
>>>>>>>>>>>>> upsert would be helpful.
>>>>>>>>>>>>> Having a datasource would give us useful data frames and also 
>>>>>>>>>>>>> make spark sql usable for kudu.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala 
>>>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high 
>>>>>>>>>>>>> concurrency when compared to spark 2. We interact with 
>>>>>>>>>>>>> datasources which do not integrate with impala. 3. We have custom 
>>>>>>>>>>>>> sql query planners for extended sql functionality.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Chris George
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>>>>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You guys make a convincing point, although on the upsert side 
>>>>>>>>>>>>> we'll need more support from the servers. Right now all you can 
>>>>>>>>>>>>> do is an INSERT then, if you get a dup key, do an UPDATE. I guess 
>>>>>>>>>>>>> we could at least add an API on the client side that would manage 
>>>>>>>>>>>>> it, but it wouldn't be atomic.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>>>>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>>wrote:
>>>>>>>>>>>>> It's pretty simple, actually.  I need to support versioned 
>>>>>>>>>>>>> datasets in a Spark SQL environment.  Instead of a hack on top of 
>>>>>>>>>>>>> a Parquet data store, I'm hoping (among other reasons) to be able 
>>>>>>>>>>>>> to use Kudu's write and timestamp-based read operations to 
>>>>>>>>>>>>> support not only appending data, but also updating existing data, 
>>>>>>>>>>>>> and even some schema migration.  The most typical use case is a 
>>>>>>>>>>>>> dataset that is updated periodically (e.g., weekly or monthly) in 
>>>>>>>>>>>>> which the the preliminary data in the previous window (week or 
>>>>>>>>>>>>> month) is updated with values that are expected to remain 
>>>>>>>>>>>>> unchanged from then on, and a new set of preliminary values for 
>>>>>>>>>>>>> the current window need to be added/appended.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Using Kudu's Java API and developing additional functionality on 
>>>>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease 
>>>>>>>>>>>>> of integration with Spark SQL will gate how quickly we would move 
>>>>>>>>>>>>> to using Kudu and how seriously we'd look at alternatives before 
>>>>>>>>>>>>> making that decision. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>>wrote:
>>>>>>>>>>>>> Mark,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it 
>>>>>>>>>>>>> caught the attention of other folks!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark 
>>>>>>>>>>>>> Hamstra<m...@clearstorydata.com <mailto:m...@clearstorydata.com>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently 
>>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert 
>>>>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu.  
>>>>>>>>>>>>> Whether Kudu does a good job supporting inserts with Spark SQL 
>>>>>>>>>>>>> will be a key consideration as to whether we adopt Kudu.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for 
>>>>>>>>>>>>> you. Is it just that you currently do it that way into some 
>>>>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to 
>>>>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and 
>>>>>>>>>>>>> directly use the Java API's KuduSession be too much work?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your 
>>>>>>>>>>>>> current solution? If it's not completely clear, I'd love to help 
>>>>>>>>>>>>> you think through it.
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What are your DS folks looking for in terms of functionality 
>>>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully featured 
>>>>>>>>>>>>> as Impala's? Do they care being able to insert into Kudu with 
>>>>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more 
>>>>>>>>>>>>> specific to Spark that I'm missing?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At 
>>>>>>>>>>>>> Cloudera all our resources are committed to making things happen 
>>>>>>>>>>>>> in time, and a more fully featured Spark integration isn't in our 
>>>>>>>>>>>>> plans during that period. I'm really hoping someone in the 
>>>>>>>>>>>>> community will help with Spark, the same way we got a big 
>>>>>>>>>>>>> contribution for the Flume sink. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim 
>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>wrote:
>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, 
>>>>>>>>>>>>> since it’s not “production-ready”, upper management doesn’t want 
>>>>>>>>>>>>> to fully deploy it yet. They just want to keep an eye on it 
>>>>>>>>>>>>> though. Kudu was so much simpler and easier to use in every 
>>>>>>>>>>>>> aspect compared to HBase. Impala was great for the report writers 
>>>>>>>>>>>>> and analysts to experiment with for the short time it was up. 
>>>>>>>>>>>>> But, once again, the only blocker was the lack of Spark support 
>>>>>>>>>>>>> for our Data Developers/Scientists. So, production-level data 
>>>>>>>>>>>>> population won’t happen until then.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim 
>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an 
>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken 
>>>>>>>>>>>>>> care of and idempotency is maintained. Whether data was directly 
>>>>>>>>>>>>>> retrieved from Cassandra for analytics, reports, or searches, it 
>>>>>>>>>>>>>> was not clear as to what was its main use. Some also just used 
>>>>>>>>>>>>>> it for a staging area to populate downstream tables in parquet 
>>>>>>>>>>>>>> format. The last thing I heard was that CQL was terrible, so 
>>>>>>>>>>>>>> that rules out much use of direct queries against it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real 
>>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. 
>>>>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for 
>>>>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As for our company, we have been looking for an updatable data 
>>>>>>>>>>>>>> store for a long time that can be quickly queried directly 
>>>>>>>>>>>>>> either using Spark SQL or Impala or some other SQL engine and 
>>>>>>>>>>>>>> still handle TB or PB of data without performance degradation 
>>>>>>>>>>>>>> and many configuration headaches. For now, we are using HBase to 
>>>>>>>>>>>>>> take on this role with Phoenix as a fast way to directly query 
>>>>>>>>>>>>>> the data. I can see Kudu as the best way to fill this gap 
>>>>>>>>>>>>>> easily, especially being the closest thing to other relational 
>>>>>>>>>>>>>> databases out there in familiarity for the many SQL analytics 
>>>>>>>>>>>>>> people in our company. The other alternative would be to go with 
>>>>>>>>>>>>>> AWS Redshift for the same reasons, but it would come at a cost, 
>>>>>>>>>>>>>> of course. If we went with either solutions, Kudu or Redshift, 
>>>>>>>>>>>>>> it would get rid of the need to extract from HBase to parquet 
>>>>>>>>>>>>>> tables or export to PostgreSQL to support more of the SQL 
>>>>>>>>>>>>>> language using by analysts or the reporting software we use..
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off 
>>>>>>>>>>>>>> with Kudu. Have you folks tried Kudu with Impala yet with those 
>>>>>>>>>>>>>> use cases?
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Ben 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like 
>>>>>>>>>>>>>>> to refer to "Impala + Kudu" as Kimpala, but yeah it's not as 
>>>>>>>>>>>>>>> sexy. My colleagues who were also there did say that the hype 
>>>>>>>>>>>>>>> around Spark isn't dying down.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, 
>>>>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that 
>>>>>>>>>>>>>>> C* is just an interim solution for the use case you describe.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's 
>>>>>>>>>>>>>>> a storage engine so things move slowly *smile*. I'd love to see 
>>>>>>>>>>>>>>> more contributions on the Spark front. I know there's code out 
>>>>>>>>>>>>>>> there that could be integrated in kudu-spark, it just needs to 
>>>>>>>>>>>>>>> land in gerrit. I'm sure folks will happily review it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to 
>>>>>>>>>>>>>>> learn more about the use cases for which you envision using 
>>>>>>>>>>>>>>> Kudu as a C* replacement.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim 
>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They 
>>>>>>>>>>>>>>> told me that everything was about Spark and there is a big buzz 
>>>>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I 
>>>>>>>>>>>>>>> still think that Cassandra is just an interim solution as a 
>>>>>>>>>>>>>>> low-latency, easily queried data store. I was wondering if 
>>>>>>>>>>>>>>> anything significant happened in regards to Kudu, especially on 
>>>>>>>>>>>>>>> the Spark front. Plus, can you come up with your own proposed 
>>>>>>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I 
>>>>>>>>>>>>>>>> know of one person on the Kudu Slack who's working on a better 
>>>>>>>>>>>>>>>> RDD, but that's about it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>>>>>>>>>>>>>> <mailto:b...@amobee.com>> wrote:
>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to 
>>>>>>>>>>>>>>>> target a version of Kudu to begin real testing of Spark 
>>>>>>>>>>>>>>>> against it for our devs. At least, I can tell them what 
>>>>>>>>>>>>>>>> timeframe to anticipate.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's 
>>>>>>>>>>>>>>>>> needed either.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally 
>>>>>>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of 
>>>>>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> The goal was to provide something for others to contribute 
>>>>>>>>>>>>>>>>> to. We have some basic unit tests that others can easily 
>>>>>>>>>>>>>>>>> extend. None of us on the team are Spark experts, but we'd be 
>>>>>>>>>>>>>>>>> really happy to assist one improve the kudu-spark code.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim 
>>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements 
>>>>>>>>>>>>>>>>> (kudu RDD, kudu DStream) in KUDU-1214. Am I right? Besides 
>>>>>>>>>>>>>>>>> shoring up more Spark SQL functionality (Dataframes) and 
>>>>>>>>>>>>>>>>> doing the documentation, what more needs to be done? 
>>>>>>>>>>>>>>>>> Optimizations?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with 
>>>>>>>>>>>>>>>>> Kudu and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this 
>>>>>>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL 
>>>>>>>>>>>>>>>>>> on Kudu, but it will require a lot more work to make it 
>>>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim 
>>>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>> I see this KUDU-1214 
>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 
>>>>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, 
>>>>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu 
>>>>>>>>>>>>>>>>>> both programmatically and as a client via Spark SQL? Or is 
>>>>>>>>>>>>>>>>>> there more work that needs to be done on the Spark side for 
>>>>>>>>>>>>>>>>>> it to work?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Spark on Kudu

Reply via email to