Re: Spark on Kudu

Benjamin Kim Tue, 14 Jun 2016 17:42:24 -0700

I tried to use the “append” mode, and it worked. Over 3.8 million rows in 64s. 
I would assume that now I can use the “overwrite” mode on existing data. Now, I 
have to find answers to these questions. What would happen if I “append” to the 
data in the Kudu table if the data already exists? What would happen if I 
“overwrite” existing data when the DataFrame has data in it that does not exist 
in the Kudu table? I need to evaluate the best way to simulate the UPSERT 
behavior in HBase because this is what our use case is.


Thanks,
Ben


> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> Hi,
> 
> Now, I’m getting this error when trying to write to the table.
> 
> import scala.collection.JavaConverters._
> val key_seq = Seq(“my_id")
> val key_list = List(“my_id”).asJava
> kuduContext.createTable(tableName, df.schema, key_seq, new 
> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
> 
> df.write
>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>     .mode("overwrite")
>     .kudu
> 
> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
> sample errors: Not found: key not found (error 0)Not found: key not found 
> (error 0)Not found: key not found (error 0)Not found: key not found (error 
> 0)Not found: key not found (error 0)
> 
> Does the key field need to be first in the DataFrame?
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> 
>> 
>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
>> in the DataFrame and set the partitioning? Is it like this?
>> 
>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>> 
>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>> using setRangePartitionColumns or addHashPartitions
>> 
>> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
>> key columns, so in this case you have specified the single PK column 
>> "my_id".  The `addHashPartitions` call adds hash partitioning to the table, 
>> in this case over the column "my_id" (which is good, it must be over one or 
>> more PK columns, so in this case "my_id" is the one and only valid 
>> combination).  However, the call to `addHashPartition` also takes the number 
>> of buckets as the second param.  You shouldn't get the 
>> IllegalArgumentException as long as you are specifying either 
>> `addHashPartitions` or `setRangePartitionColumns`.
>> 
>> - Dan
>>  
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Looks like we're missing an import statement in that example.  Could you 
>>> try:
>>> 
>>> import org.kududb.client._
>>> and try again?
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> I encountered an error trying to create a table based on the documentation 
>>> from a DataFrame.
>>> 
>>> <console>:49: error: not found: type CreateTableOptions
>>>               kuduContext.createTable(tableName, df.schema, Seq("key"), new 
>>> CreateTableOptions().setNumReplicas(1))
>>> 
>>> Is there something I’m missing?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> It's only in Cloudera's maven repo: 
>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>  
>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi J-D,
>>>> 
>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>>>> spark-shell to use. Can you show me where to find it?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> 
>>>>> What's in this doc is what's gonna get released: 
>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>  
>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>> 
>>>>>> It will be in 0.9.0.
>>>>>> 
>>>>>> J-D
>>>>>> 
>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> Hi Chris,
>>>>>> 
>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com 
>>>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>>> 
>>>>>>> There is some code in review that needs some more refinement.
>>>>>>> It will allow upsert/insert from a dataframe using the datasource api. 
>>>>>>> It will also allow the creation and deletion of tables from a dataframe
>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>>> 
>>>>>>> Example usages will look something like:
>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>>> 
>>>>>>> -Chris George
>>>>>>> 
>>>>>>> 
>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>> 
>>>>>>> Also, does anyone have any sample code on how to update/insert data in 
>>>>>>> Kudu using DataFrames?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com 
>>>>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>>>> 
>>>>>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>>>>>> implement similar functionality through the api.
>>>>>>>> -Chris
>>>>>>>> 
>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if 
>>>>>>>> it were to be implemented.
>>>>>>>> 
>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George 
>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>>>>>> gerrit if you want to take a look. 
>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>>> It does pushdown predicates which the existing input formatter based 
>>>>>>>>> rdd does not.
>>>>>>>>> 
>>>>>>>>> Within the next two weeks I’m planning to implement a datasource for 
>>>>>>>>> spark that will have pushdown predicates and insertion/update 
>>>>>>>>> functionality (need to look more at cassandra and the hbase 
>>>>>>>>> datasource for best way to do this) I agree that server side upsert 
>>>>>>>>> would be helpful.
>>>>>>>>> Having a datasource would give us useful data frames and also make 
>>>>>>>>> spark sql usable for kudu.
>>>>>>>>> 
>>>>>>>>> My reasoning for having a spark datasource and not using Impala is: 
>>>>>>>>> 1. We have had trouble getting impala to run fast with high 
>>>>>>>>> concurrency when compared to spark 2. We interact with datasources 
>>>>>>>>> which do not integrate with impala. 3. We have custom sql query 
>>>>>>>>> planners for extended sql functionality.
>>>>>>>>> 
>>>>>>>>> -Chris George
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>> 
>>>>>>>>> You guys make a convincing point, although on the upsert side we'll 
>>>>>>>>> need more support from the servers. Right now all you can do is an 
>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could at 
>>>>>>>>> least add an API on the client side that would manage it, but it 
>>>>>>>>> wouldn't be atomic.
>>>>>>>>> 
>>>>>>>>> J-D
>>>>>>>>> 
>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>>> It's pretty simple, actually.  I need to support versioned datasets 
>>>>>>>>> in a Spark SQL environment.  Instead of a hack on top of a Parquet 
>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use Kudu's 
>>>>>>>>> write and timestamp-based read operations to support not only 
>>>>>>>>> appending data, but also updating existing data, and even some schema 
>>>>>>>>> migration.  The most typical use case is a dataset that is updated 
>>>>>>>>> periodically (e.g., weekly or monthly) in which the the preliminary 
>>>>>>>>> data in the previous window (week or month) is updated with values 
>>>>>>>>> that are expected to remain unchanged from then on, and a new set of 
>>>>>>>>> preliminary values for the current window need to be added/appended.
>>>>>>>>> 
>>>>>>>>> Using Kudu's Java API and developing additional functionality on top 
>>>>>>>>> of what Kudu has to offer isn't too much to ask, but the ease of 
>>>>>>>>> integration with Spark SQL will gate how quickly we would move to 
>>>>>>>>> using Kudu and how seriously we'd look at alternatives before making 
>>>>>>>>> that decision. 
>>>>>>>>> 
>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>> Mark,
>>>>>>>>> 
>>>>>>>>> Thanks for taking some time to reply in this thread, glad it caught 
>>>>>>>>> the attention of other folks!
>>>>>>>>> 
>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>> 
>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying 
>>>>>>>>> a refactoring of some Spark SQL-oriented insert functionality while 
>>>>>>>>> trying to evaluate what to expect from Kudu.  Whether Kudu does a 
>>>>>>>>> good job supporting inserts with Spark SQL will be a key 
>>>>>>>>> consideration as to whether we adopt Kudu.
>>>>>>>>> 
>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for 
>>>>>>>>> you. Is it just that you currently do it that way into some database 
>>>>>>>>> or parquet so with minimal refactoring you'd be able to use Kudu? 
>>>>>>>>> Would re-writing those SQL lines into Scala and directly use the Java 
>>>>>>>>> API's KuduSession be too much work?
>>>>>>>>> 
>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your 
>>>>>>>>> current solution? If it's not completely clear, I'd love to help you 
>>>>>>>>> think through it.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>> 
>>>>>>>>> What are your DS folks looking for in terms of functionality related 
>>>>>>>>> to Spark? A SparkSQL integration that's as fully featured as 
>>>>>>>>> Impala's? Do they care being able to insert into Kudu with SparkSQL 
>>>>>>>>> or just being able to query real fast? Anything more specific to 
>>>>>>>>> Spark that I'm missing?
>>>>>>>>> 
>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera 
>>>>>>>>> all our resources are committed to making things happen in time, and 
>>>>>>>>> a more fully featured Spark integration isn't in our plans during 
>>>>>>>>> that period. I'm really hoping someone in the community will help 
>>>>>>>>> with Spark, the same way we got a big contribution for the Flume 
>>>>>>>>> sink. 
>>>>>>>>> 
>>>>>>>>> J-D
>>>>>>>>> 
>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, 
>>>>>>>>> since it’s not “production-ready”, upper management doesn’t want to 
>>>>>>>>> fully deploy it yet. They just want to keep an eye on it though. Kudu 
>>>>>>>>> was so much simpler and easier to use in every aspect compared to 
>>>>>>>>> HBase. Impala was great for the report writers and analysts to 
>>>>>>>>> experiment with for the short time it was up. But, once again, the 
>>>>>>>>> only blocker was the lack of Spark support for our Data 
>>>>>>>>> Developers/Scientists. So, production-level data population won’t 
>>>>>>>>> happen until then.
>>>>>>>>> 
>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>> J-D,
>>>>>>>>>> 
>>>>>>>>>> The main thing I hear that Cassandra is being used as an updatable 
>>>>>>>>>> hot data store to ensure that duplicates are taken care of and 
>>>>>>>>>> idempotency is maintained. Whether data was directly retrieved from 
>>>>>>>>>> Cassandra for analytics, reports, or searches, it was not clear as 
>>>>>>>>>> to what was its main use. Some also just used it for a staging area 
>>>>>>>>>> to populate downstream tables in parquet format. The last thing I 
>>>>>>>>>> heard was that CQL was terrible, so that rules out much use of 
>>>>>>>>>> direct queries against it.
>>>>>>>>>> 
>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics, 
>>>>>>>>>> just ease of use instead of plainly using the APIs. Even then, Kudu 
>>>>>>>>>> should beat it easily on big scans. Same for HBase. We've done 
>>>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> As for our company, we have been looking for an updatable data store 
>>>>>>>>>> for a long time that can be quickly queried directly either using 
>>>>>>>>>> Spark SQL or Impala or some other SQL engine and still handle TB or 
>>>>>>>>>> PB of data without performance degradation and many configuration 
>>>>>>>>>> headaches. For now, we are using HBase to take on this role with 
>>>>>>>>>> Phoenix as a fast way to directly query the data. I can see Kudu as 
>>>>>>>>>> the best way to fill this gap easily, especially being the closest 
>>>>>>>>>> thing to other relational databases out there in familiarity for the 
>>>>>>>>>> many SQL analytics people in our company. The other alternative 
>>>>>>>>>> would be to go with AWS Redshift for the same reasons, but it would 
>>>>>>>>>> come at a cost, of course. If we went with either solutions, Kudu or 
>>>>>>>>>> Redshift, it would get rid of the need to extract from HBase to 
>>>>>>>>>> parquet tables or export to PostgreSQL to support more of the SQL 
>>>>>>>>>> language using by analysts or the reporting software we use..
>>>>>>>>>> 
>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with 
>>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use cases?
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> I hope this helps.
>>>>>>>>>> 
>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben 
>>>>>>>>>> 
>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to 
>>>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My 
>>>>>>>>>>> colleagues who were also there did say that the hype around Spark 
>>>>>>>>>>> isn't dying down.
>>>>>>>>>>> 
>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, 
>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* is 
>>>>>>>>>>> just an interim solution for the use case you describe.
>>>>>>>>>>> 
>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a 
>>>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see more 
>>>>>>>>>>> contributions on the Spark front. I know there's code out there 
>>>>>>>>>>> that could be integrated in kudu-spark, it just needs to land in 
>>>>>>>>>>> gerrit. I'm sure folks will happily review it.
>>>>>>>>>>> 
>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to learn 
>>>>>>>>>>> more about the use cases for which you envision using Kudu as a C* 
>>>>>>>>>>> replacement.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> J-D
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>> Hi J-D,
>>>>>>>>>>> 
>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They told 
>>>>>>>>>>> me that everything was about Spark and there is a big buzz about 
>>>>>>>>>>> the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I still 
>>>>>>>>>>> think that Cassandra is just an interim solution as a low-latency, 
>>>>>>>>>>> easily queried data store. I was wondering if anything significant 
>>>>>>>>>>> happened in regards to Kudu, especially on the Spark front. Plus, 
>>>>>>>>>>> can you come up with your own proposed stack acronym to promote?
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>> 
>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I 
>>>>>>>>>>>> know of one person on the Kudu Slack who's working on a better 
>>>>>>>>>>>> RDD, but that's about it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> 
>>>>>>>>>>>> J-D
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>>>>>>>>>> <mailto:b...@amobee.com>> wrote:
>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>> 
>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a 
>>>>>>>>>>>> version of Kudu to begin real testing of Spark against it for our 
>>>>>>>>>>>> devs. At least, I can tell them what timeframe to anticipate.
>>>>>>>>>>>> 
>>>>>>>>>>>> Just curious,
>>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>>> 
>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>>> 
>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed 
>>>>>>>>>>>>> either.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd 
>>>>>>>>>>>>> use scans directly.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of 
>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The goal was to provide something for others to contribute to. We 
>>>>>>>>>>>>> have some basic unit tests that others can easily extend. None of 
>>>>>>>>>>>>> us on the team are Spark experts, but we'd be really happy to 
>>>>>>>>>>>>> assist one improve the kudu-spark code.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu 
>>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring up 
>>>>>>>>>>>>> more Spark SQL functionality (Dataframes) and doing the 
>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu 
>>>>>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in 
>>>>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on 
>>>>>>>>>>>>>> Kudu, but it will require a lot more work to make it fast/useful.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim 
>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>> I see this KUDU-1214 
>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 
>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, will 
>>>>>>>>>>>>>> this mean that Spark will be able to work with Kudu both 
>>>>>>>>>>>>>> programmatically and as a client via Spark SQL? Or is there more 
>>>>>>>>>>>>>> work that needs to be done on the Spark side for it to work?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>

Re: Spark on Kudu

Reply via email to