http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <bbuil...@gmail.com> wrote: > I see that the API has changed a bit so my old code doesn’t work anymore. > Can someone direct me to some code samples? > > Thanks, > Ben > > > On Sep 20, 2016, at 1:44 PM, Todd Lipcon <t...@cloudera.com> wrote: > > On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > >> Now that Kudu 1.0.0 is officially out and ready for production use, where >> do we find the spark connector jar for this release? >> >> > It's available in the official ASF maven repository: > https://repository.apache.org/#nexus-search;quick~kudu-spark > > <dependency> > <groupId>org.apache.kudu</groupId> > <artifactId>kudu-spark_2.10</artifactId> > <version>1.0.0</version> > </dependency> > > > -Todd > > > >> On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com> wrote: >> >> Hi Ben, >> >> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I >> do not think we support that at this point. I haven't looked deeply into >> it, but we may hit issues specifying Kudu-specific options (partitioning, >> column encoding, etc.). Probably issues that can be worked through >> eventually, though. If you are interested in contributing to Kudu, this is >> an area that could obviously use improvement! Most or all of our Spark >> features have been completely community driven to date. >> >> >>> I am assuming that more Spark support along with semantic changes below >>> will be incorporated into Kudu 0.9.1. >>> >> >> As a rule we do not release new features in patch releases, but the good >> news is that we are releasing regularly, and our next scheduled release is >> for the August timeframe (see JD's roadmap >> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E> >> email >> about what we are aiming to include). Also, Cloudera does publish snapshot >> versions of the Spark connector here >> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, >> so the jars are available if you don't mind using snapshots. >> >> >>> Anyone know of a better way to make unique primary keys other than using >>> UUID to make every row unique if there is no unique column (or combination >>> thereof) to use. >>> >> >> Not that I know of. In general it's pretty rare to have a dataset >> without a natural primary key (even if it's just all of the columns), but >> in those cases UUID is a good solution. >> >> >>> This is what I am using. I know auto incrementing is coming down the >>> line (don’t know when), but is there a way to simulate this in Kudu using >>> Spark out of curiosity? >>> >> >> To my knowledge there is no plan to have auto increment in Kudu. >> Distributed, consistent, auto incrementing counters is a difficult problem, >> and I don't think there are any known solutions that would be fast enough >> for Kudu (happy to be proven wrong, though!). >> >> - Dan >> >> >>> >>> Thanks, >>> Ben >>> >>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote: >>> >>> I'm not sure exactly what the semantics will be, but at least one of >>> them will be upsert. These modes come from spark, and they were really >>> designed for file-backed storage and not table storage. We may want to do >>> append = upsert, and overwrite = truncate + insert. I think that may match >>> the normal spark semantics more closely. >>> >>> - Dan >>> >>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com> >>> wrote: >>> >>>> Dan, >>>> >>>> Thanks for the information. That would mean both “append” and >>>> “overwrite” modes would be combined or not needed in the future. >>>> >>>> Cheers, >>>> Ben >>>> >>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com> wrote: >>>> >>>> Right now append uses an update Kudu operation, which requires the row >>>> already be present in the table. Overwrite maps to insert. Kudu very >>>> recently got upsert support baked in, but it hasn't yet been integrated >>>> into the Spark connector. So pretty soon these sharp edges will get a lot >>>> better, since upsert is the way to go for most spark workloads. >>>> >>>> - Dan >>>> >>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com> >>>> wrote: >>>> >>>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows >>>>> in 64s. I would assume that now I can use the “overwrite” mode on existing >>>>> data. Now, I have to find answers to these questions. What would happen if >>>>> I “append” to the data in the Kudu table if the data already exists? What >>>>> would happen if I “overwrite” existing data when the DataFrame has data in >>>>> it that does not exist in the Kudu table? I need to evaluate the best way >>>>> to simulate the UPSERT behavior in HBase because this is what our use case >>>>> is. >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>> >>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> Now, I’m getting this error when trying to write to the table. >>>>> >>>>> import scala.collection.JavaConverters._ >>>>> val key_seq = Seq(“my_id") >>>>> val key_list = List(“my_id”).asJava >>>>> kuduContext.createTable(tableName, df.schema, key_seq, new >>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100)) >>>>> >>>>> df.write >>>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> >>>>> tableName)) >>>>> .mode("overwrite") >>>>> .kudu >>>>> >>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame >>>>> to Kudu; sample errors: Not found: key not found (error 0)Not found: key >>>>> not found (error 0)Not found: key not found (error 0)Not found: key not >>>>> found (error 0)Not found: key not found (error 0) >>>>> >>>>> Does the key field need to be first in the DataFrame? >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote: >>>>> >>>>> >>>>> >>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com> >>>>> wrote: >>>>> >>>>>> Dan, >>>>>> >>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a >>>>>> column(s) in the DataFrame and set the partitioning? Is it like this? >>>>>> >>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new >>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id")) >>>>>> >>>>>> java.lang.IllegalArgumentException: Table partitioning must be >>>>>> specified using setRangePartitionColumns or addHashPartitions >>>>>> >>>>> >>>>> Yep. The `Seq("my_id")` part of that call is specifying the set of >>>>> primary key columns, so in this case you have specified the single PK >>>>> column "my_id". The `addHashPartitions` call adds hash partitioning to >>>>> the >>>>> table, in this case over the column "my_id" (which is good, it must be >>>>> over >>>>> one or more PK columns, so in this case "my_id" is the one and only valid >>>>> combination). However, the call to `addHashPartition` also takes the >>>>> number of buckets as the second param. You shouldn't get the >>>>> IllegalArgumentException as long as you are specifying either >>>>> `addHashPartitions` or `setRangePartitionColumns`. >>>>> >>>>> - Dan >>>>> >>>>> >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> >>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote: >>>>>> >>>>>> Looks like we're missing an import statement in that example. Could >>>>>> you try: >>>>>> >>>>>> import org.kududb.client._ >>>>>> >>>>>> and try again? >>>>>> >>>>>> - Dan >>>>>> >>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I encountered an error trying to create a table based on the >>>>>>> documentation from a DataFrame. >>>>>>> >>>>>>> <console>:49: error: not found: type CreateTableOptions >>>>>>> kuduContext.createTable(tableName, df.schema, >>>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1)) >>>>>>> >>>>>>> Is there something I’m missing? >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>> It's only in Cloudera's maven repo: >>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/ >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi J-D, >>>>>>>> >>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark >>>>>>>> jar for spark-shell to use. Can you show me where to find it? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>> What's in this doc is what's gonna get released: >>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Will this be documented with examples once 0.9.0 comes out? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans < >>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>> >>>>>>>>> It will be in 0.9.0. >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Chris, >>>>>>>>>> >>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George < >>>>>>>>>> christopher.geo...@rms.com> wrote: >>>>>>>>>> >>>>>>>>>> There is some code in review that needs some more refinement. >>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource >>>>>>>>>> api. It will also allow the creation and deletion of tables from a >>>>>>>>>> dataframe >>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>>>>>>> >>>>>>>>>> Example usages will look something like: >>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>>>>>>>>> >>>>>>>>>> -Chris George >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Can someone tell me what the state is of this Spark work? >>>>>>>>>> >>>>>>>>>> Also, does anyone have any sample code on how to update/insert >>>>>>>>>> data in Kudu using DataFrames? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George < >>>>>>>>>> christopher.geo...@rms.com> wrote: >>>>>>>>>> >>>>>>>>>> SparkSQL cannot support these type of statements but we may be >>>>>>>>>> able to implement similar functionality through the api. >>>>>>>>>> -Chris >>>>>>>>>> >>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an >>>>>>>>>> “upsert” if it were to be implemented. >>>>>>>>>> >>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition) >>>>>>>>>> WHEN MATCHED THEN >>>>>>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>>>>>>>> WHEN NOT MATCHED THEN >>>>>>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George < >>>>>>>>>> christopher.geo...@rms.com> wrote: >>>>>>>>>> >>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it >>>>>>>>>> into gerrit if you want to take a look. >>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ >>>>>>>>>> It does pushdown predicates which the existing input formatter >>>>>>>>>> based rdd does not. >>>>>>>>>> >>>>>>>>>> Within the next two weeks I’m planning to implement a datasource >>>>>>>>>> for spark that will have pushdown predicates and insertion/update >>>>>>>>>> functionality (need to look more at cassandra and the hbase >>>>>>>>>> datasource for >>>>>>>>>> best way to do this) I agree that server side upsert would be >>>>>>>>>> helpful. >>>>>>>>>> Having a datasource would give us useful data frames and also >>>>>>>>>> make spark sql usable for kudu. >>>>>>>>>> >>>>>>>>>> My reasoning for having a spark datasource and not using Impala >>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high >>>>>>>>>> concurrency >>>>>>>>>> when compared to spark 2. We interact with datasources which do not >>>>>>>>>> integrate with impala. 3. We have custom sql query planners for >>>>>>>>>> extended >>>>>>>>>> sql functionality. >>>>>>>>>> >>>>>>>>>> -Chris George >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> You guys make a convincing point, although on the upsert side >>>>>>>>>> we'll need more support from the servers. Right now all you can do >>>>>>>>>> is an >>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could at >>>>>>>>>> least >>>>>>>>>> add an API on the client side that would manage it, but it wouldn't >>>>>>>>>> be >>>>>>>>>> atomic. >>>>>>>>>> >>>>>>>>>> J-D >>>>>>>>>> >>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra < >>>>>>>>>> m...@clearstorydata.com>wrote: >>>>>>>>>> >>>>>>>>>>> It's pretty simple, actually. I need to support versioned >>>>>>>>>>> datasets in a Spark SQL environment. Instead of a hack on top of a >>>>>>>>>>> Parquet >>>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use >>>>>>>>>>> Kudu's write >>>>>>>>>>> and timestamp-based read operations to support not only appending >>>>>>>>>>> data, but >>>>>>>>>>> also updating existing data, and even some schema migration. The >>>>>>>>>>> most >>>>>>>>>>> typical use case is a dataset that is updated periodically (e.g., >>>>>>>>>>> weekly or >>>>>>>>>>> monthly) in which the the preliminary data in the previous window >>>>>>>>>>> (week or >>>>>>>>>>> month) is updated with values that are expected to remain unchanged >>>>>>>>>>> from >>>>>>>>>>> then on, and a new set of preliminary values for the current window >>>>>>>>>>> need to >>>>>>>>>>> be added/appended. >>>>>>>>>>> >>>>>>>>>>> Using Kudu's Java API and developing additional functionality on >>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease of >>>>>>>>>>> integration with Spark SQL will gate how quickly we would move to >>>>>>>>>>> using >>>>>>>>>>> Kudu and how seriously we'd look at alternatives before making that >>>>>>>>>>> decision. >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans < >>>>>>>>>>> jdcry...@apache.org>wrote: >>>>>>>>>>> >>>>>>>>>>>> Mark, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it >>>>>>>>>>>> caught the attention of other folks! >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra< >>>>>>>>>>>> m...@clearstorydata.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I care about insert into Kudu with Spark SQL. I'm currently >>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert >>>>>>>>>>>>> functionality >>>>>>>>>>>>> while trying to evaluate what to expect from Kudu. Whether Kudu >>>>>>>>>>>>> does a >>>>>>>>>>>>> good job supporting inserts with Spark SQL will be a key >>>>>>>>>>>>> consideration as >>>>>>>>>>>>> to whether we adopt Kudu. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary >>>>>>>>>>>> for you. Is it just that you currently do it that way into some >>>>>>>>>>>> database or >>>>>>>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? >>>>>>>>>>>> Would >>>>>>>>>>>> re-writing those SQL lines into Scala and directly use the Java >>>>>>>>>>>> API's >>>>>>>>>>>> KuduSession be too much work? >>>>>>>>>>>> >>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS >>>>>>>>>>>> your current solution? If it's not completely clear, I'd love to >>>>>>>>>>>> help you >>>>>>>>>>>> think through it. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans < >>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yup, starting to get a good idea. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What are your DS folks looking for in terms of functionality >>>>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully >>>>>>>>>>>>>> featured as >>>>>>>>>>>>>> Impala's? Do they care being able to insert into Kudu with >>>>>>>>>>>>>> SparkSQL or just >>>>>>>>>>>>>> being able to query real fast? Anything more specific to Spark >>>>>>>>>>>>>> that I'm >>>>>>>>>>>>>> missing? >>>>>>>>>>>>>> >>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At >>>>>>>>>>>>>> Cloudera all our resources are committed to making things happen >>>>>>>>>>>>>> in time, >>>>>>>>>>>>>> and a more fully featured Spark integration isn't in our plans >>>>>>>>>>>>>> during that >>>>>>>>>>>>>> period. I'm really hoping someone in the community will help >>>>>>>>>>>>>> with Spark, >>>>>>>>>>>>>> the same way we got a big contribution for the Flume sink. >>>>>>>>>>>>>> >>>>>>>>>>>>>> J-D >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim < >>>>>>>>>>>>>> bbuil...@gmail.com>wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. >>>>>>>>>>>>>>> But, since it’s not “production-ready”, upper management >>>>>>>>>>>>>>> doesn’t want to >>>>>>>>>>>>>>> fully deploy it yet. They just want to keep an eye on it >>>>>>>>>>>>>>> though. Kudu was >>>>>>>>>>>>>>> so much simpler and easier to use in every aspect compared to >>>>>>>>>>>>>>> HBase. Impala >>>>>>>>>>>>>>> was great for the report writers and analysts to experiment >>>>>>>>>>>>>>> with for the >>>>>>>>>>>>>>> short time it was up. But, once again, the only blocker was the >>>>>>>>>>>>>>> lack of >>>>>>>>>>>>>>> Spark support for our Data Developers/Scientists. So, >>>>>>>>>>>>>>> production-level data >>>>>>>>>>>>>>> population won’t happen until then. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans < >>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim < >>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an >>>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken >>>>>>>>>>>>>>>> care of and >>>>>>>>>>>>>>>> idempotency is maintained. Whether data was directly retrieved >>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>> Cassandra for analytics, reports, or searches, it was not >>>>>>>>>>>>>>>> clear as to what >>>>>>>>>>>>>>>> was its main use. Some also just used it for a staging area to >>>>>>>>>>>>>>>> populate >>>>>>>>>>>>>>>> downstream tables in parquet format. The last thing I heard >>>>>>>>>>>>>>>> was that CQL >>>>>>>>>>>>>>>> was terrible, so that rules out much use of direct queries >>>>>>>>>>>>>>>> against it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real >>>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. >>>>>>>>>>>>>>> Even then, >>>>>>>>>>>>>>> Kudu should beat it easily on big scans. Same for HBase. We've >>>>>>>>>>>>>>> done >>>>>>>>>>>>>>> benchmarks against the latter, not the former. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As for our company, we have been looking for an updatable >>>>>>>>>>>>>>>> data store for a long time that can be quickly queried >>>>>>>>>>>>>>>> directly either >>>>>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still >>>>>>>>>>>>>>>> handle TB or >>>>>>>>>>>>>>>> PB of data without performance degradation and many >>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>> headaches. For now, we are using HBase to take on this role >>>>>>>>>>>>>>>> with Phoenix as >>>>>>>>>>>>>>>> a fast way to directly query the data. I can see Kudu as the >>>>>>>>>>>>>>>> best way to >>>>>>>>>>>>>>>> fill this gap easily, especially being the closest thing to >>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>> relational databases out there in familiarity for the many SQL >>>>>>>>>>>>>>>> analytics >>>>>>>>>>>>>>>> people in our company. The other alternative would be to go >>>>>>>>>>>>>>>> with AWS >>>>>>>>>>>>>>>> Redshift for the same reasons, but it would come at a cost, of >>>>>>>>>>>>>>>> course. If >>>>>>>>>>>>>>>> we went with either solutions, Kudu or Redshift, it would get >>>>>>>>>>>>>>>> rid of the >>>>>>>>>>>>>>>> need to extract from HBase to parquet tables or export to >>>>>>>>>>>>>>>> PostgreSQL to >>>>>>>>>>>>>>>> support more of the SQL language using by analysts or the >>>>>>>>>>>>>>>> reporting >>>>>>>>>>>>>>>> software we use.. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off >>>>>>>>>>>>>>> with Kudu. Have you folks tried Kudu with Impala yet with those >>>>>>>>>>>>>>> use cases? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I hope this helps. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It does, thanks for nice reply. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we >>>>>>>>>>>>>>>> like to refer to "Impala + Kudu" as Kimpala, but yeah it's not >>>>>>>>>>>>>>>> as sexy. My >>>>>>>>>>>>>>>> colleagues who were also there did say that the hype around >>>>>>>>>>>>>>>> Spark isn't >>>>>>>>>>>>>>>> dying down. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There's definitely an overlap in the use cases that >>>>>>>>>>>>>>>> Cassandra, HBase, and Kudu cater to. I wouldn't go as far as >>>>>>>>>>>>>>>> saying that C* >>>>>>>>>>>>>>>> is just an interim solution for the use case you describe. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, >>>>>>>>>>>>>>>> it's a storage engine so things move slowly *smile*. I'd love >>>>>>>>>>>>>>>> to see more >>>>>>>>>>>>>>>> contributions on the Spark front. I know there's code out >>>>>>>>>>>>>>>> there that could >>>>>>>>>>>>>>>> be integrated in kudu-spark, it just needs to land in gerrit. >>>>>>>>>>>>>>>> I'm sure >>>>>>>>>>>>>>>> folks will happily review it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to >>>>>>>>>>>>>>>> learn more about the use cases for which you envision using >>>>>>>>>>>>>>>> Kudu as a C* >>>>>>>>>>>>>>>> replacement. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim < >>>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. >>>>>>>>>>>>>>>>> They told me that everything was about Spark and there is a >>>>>>>>>>>>>>>>> big buzz about >>>>>>>>>>>>>>>>> the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I >>>>>>>>>>>>>>>>> still think that >>>>>>>>>>>>>>>>> Cassandra is just an interim solution as a low-latency, >>>>>>>>>>>>>>>>> easily queried data >>>>>>>>>>>>>>>>> store. I was wondering if anything significant happened in >>>>>>>>>>>>>>>>> regards to Kudu, >>>>>>>>>>>>>>>>> especially on the Spark front. Plus, can you come up with >>>>>>>>>>>>>>>>> your own proposed >>>>>>>>>>>>>>>>> stack acronym to promote? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Ben, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any >>>>>>>>>>>>>>>>> timeline. I know of one person on the Kudu Slack who's >>>>>>>>>>>>>>>>> working on a better >>>>>>>>>>>>>>>>> RDD, but that's about it. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim < >>>>>>>>>>>>>>>>> b...@amobee.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to >>>>>>>>>>>>>>>>>> target a version of Kudu to begin real testing of Spark >>>>>>>>>>>>>>>>>> against it for our >>>>>>>>>>>>>>>>>> devs. At least, I can tell them what timeframe to anticipate. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Just curious, >>>>>>>>>>>>>>>>>> *Benjamin Kim* >>>>>>>>>>>>>>>>>> *Data Solutions Architect* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital marketing. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>* >>>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA >>>>>>>>>>>>>>>>>> 90405 | www.amobee.com >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if >>>>>>>>>>>>>>>>>> it's needed either. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, >>>>>>>>>>>>>>>>>> ideally we'd use scans directly. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of >>>>>>>>>>>>>>>>>> pushdown. It's really basic. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The goal was to provide something for others to >>>>>>>>>>>>>>>>>> contribute to. We have some basic unit tests that others can >>>>>>>>>>>>>>>>>> easily extend. >>>>>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be really >>>>>>>>>>>>>>>>>> happy to >>>>>>>>>>>>>>>>>> assist one improve the kudu-spark code. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim < >>>>>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements >>>>>>>>>>>>>>>>>>> (kudu RDD, kudu DStream) in KUDU-1214. Am I right? Besides >>>>>>>>>>>>>>>>>>> shoring up more >>>>>>>>>>>>>>>>>>> Spark SQL functionality (Dataframes) and doing the >>>>>>>>>>>>>>>>>>> documentation, what more >>>>>>>>>>>>>>>>>>> needs to be done? Optimizations? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark >>>>>>>>>>>>>>>>>>> with Kudu and compare it to HBase with Spark (not clean). >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get >>>>>>>>>>>>>>>>>>> this in for 0.7.0: >>>>>>>>>>>>>>>>>>> https://issues.cloudera.org/browse/KUDU-1321 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use >>>>>>>>>>>>>>>>>>> SparkSQL on Kudu, but it will require a lot more work to >>>>>>>>>>>>>>>>>>> make it >>>>>>>>>>>>>>>>>>> fast/useful. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hope this helps, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim < >>>>>>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted >>>>>>>>>>>>>>>>>>>> for 0.8.0, but I see no progress on it. When this is >>>>>>>>>>>>>>>>>>>> complete, will this >>>>>>>>>>>>>>>>>>>> mean that Spark will be able to work with Kudu both >>>>>>>>>>>>>>>>>>>> programmatically and as >>>>>>>>>>>>>>>>>>>> a client via Spark SQL? Or is there more work that needs >>>>>>>>>>>>>>>>>>>> to be done on the >>>>>>>>>>>>>>>>>>>> Spark side for it to work? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Just curious. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > >