Re: Kudu supoort in cloudera director

2016-07-13 Thread Jean-Daniel Cryans
Hi Amit,

This is a vendor-specific question, there's nothing we can do in Kudu to
enable Cloudera Director support. I'd suggest you contact Cloudera.

Thanks,

J-D

On Wed, Jul 13, 2016 at 2:36 AM, Amit Adhau  wrote:

> Hi Kudu Team,
>
> At present Kudu roles are not supported by cloudera director, in the
> configuration file which we use to setup the cluster.
>
> Can you please suggest, when can we expect the Kudu support in Cloudera
> Director.
>
> --
> Thanks & Regards,
>
> *Amit Adhau* | Data Architect
>
> *GLOBANT* | IND:+91 9821518132
>
> [image: Facebook] 
>
> [image: Twitter] 
>
> [image: Youtube] 
>
> [image: Linkedin] 
>
> [image: Pinterest] 
>
> [image: Globant] 
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>


Re: Error using newDelete from java API

2016-07-05 Thread Jean-Daniel Cryans
Hi,

This kind of operation is currently not supported, if you want to delete
all the rows that start with "102", you need to first read them with a scan
then issue a delete for each.

J-D

On Tue, Jul 5, 2016 at 1:18 PM, Juan Pablo Briganti <
juan.briga...@globant.com> wrote:

> Hello!
>
> I have a short question. I'm trying to delete from a kudu table using java
> API client (last version, 0.9) and I'm getting an error.
> I have a table with a composed primary key based on 2 fields (keyfield1
> and keyfield2). When I try to delete from that table using keyfield1 as a
> filter, I receive the next error:
>
> java.lang.IllegalStateException: Primary key column keyfield2 is not set
>
> The code I'm using is:
>
> Delete delete = table.newDelete();
> PartialRow row = delete.getRow();
> row.addLong("keyfield1", 102);
> session.apply(delete);
>
> Is there any other way to delete a set of rows using java API?
> Can I delete using a field that is part of a primary key or is not a pk at
> all?
>
> If you need more info let me know.
> Thanks for the help!
>
> --
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>


Re: Kundera - JPA compliant Object Datastore Mapper for Kudu

2016-07-01 Thread Jean-Daniel Cryans
Sounds good Karthik :)

Yeah the data model is close to relational, without the relations and many
features are currently missing like indexing.

Keep us posted if there's anything we can do on the Kudu side to make your
life easier.

Thanks,

J-D



On Fri, Jul 1, 2016 at 4:20 AM, Karthik Prasad Manchala <
karthikp.manch...@impetus.co.in> wrote:

> Thanks for sharing J-D,
>
>
> Since Kudu's data model is very close to relational databases, using Kudu
> with Object Mappers would be very effective for some usecases.
>
>
> Kundera supports Polyglot persistence
> <https://github.com/impetus-opensource/Kundera/wiki/Polyglot-Persistence>
> out-of-the box, which allows usage and querying on more than one databases
> (ex.. Kudu + Cassandra) with the simple JPA interface.
>
>
> Kundera also supports indexing of column data on elasticsearch for
> extensive querying, just with little configuration changes. Using this one
> can get the advantage of complex querying on any datastore.
>
>
> In addition, all the other advantages/disadvantages of a typical Object
> Mapper.
>
>
> Best,
>
> Karthik
> --
> *From:* Jean-Daniel Cryans 
> *Sent:* 30 June 2016 20:49
> *To:* user@kudu.incubator.apache.org
> *Cc:* kundera
> *Subject:* Re: Kundera - JPA compliant Object Datastore Mapper for Kudu
>
> Cool, great to hear, Karthik :)
>
> We do need some sort of roadmap on the website/wiki/or just as a list of
> jiras with tags. Right now we've been mostly executing on this:
> http://mail-archives.apache.org/mod_mbox/kudu-dev/201602.mbox/%3ccagptdncmbwwx8p+ygkzhfl2xcmktscu-rhlcqfsns1uvsbr...@mail.gmail.com%3E
>
> We currently don't recommend running Kudu in production, so you won't find
> many instances of it. This will change with 1.0 as described in the link
> above. You can still find some companies using Kudu, for example Xiaomi's
> use case is described from slide #13 in this deck:
> http://www.slideshare.net/cloudera/apache-kudu-incubating-new-hadoop-storage-for-fast-analytics-on-fast-data-by-todd-lipcon-software-engineer-cloudera-kudu-founder
>
> As for insights on using Kudu with Object Mappers, you tell me :) I think
> one thing missing that'll make it even more powerful would be to have
> nested types support in Kudu. There's a jira for it but no one's working on
> it: https://is <https://issues.apache.org/jira/browse/KUDU-1261>
> thesues.apache.org/jira/browse/KUDU-1261
> <https://issues.apache.org/jira/browse/KUDU-1261>
>
> Cheers,
>
> J-D
>
> On Thu, Jun 30, 2016 at 1:48 AM, Karthik Prasad Manchala <
> karthikp.manch...@impetus.co.in> wrote:
>
>> Hi J-D,
>>
>>
>> I have positive feedback on Kudu's API. I can say it is one of the
>> easiest APIs I have used till date.
>>
>>
>> There are no such weird things that I noticed but I did not come across
>> any road map of Kudu for the features to be planned. It would be
>> amazing to know.
>>
>>
>> Also, I was wondering if you could share any instances where Kudu
>> is being used in production? And some insights on using Kudu with an Object
>> Mapper tool like Kundera?
>>
>>
>> Thanks,
>>
>> Karthik.
>> --
>> *From:* Jean-Daniel Cryans 
>> *Sent:* 29 June 2016 22:40
>> *To:* user@kudu.incubator.apache.org
>> *Subject:* Re: Kundera - JPA compliant Object Datastore Mapper for Kudu
>>
>> Hi Karthik!
>>
>> Thanks for sharing this.
>>
>> I see that you've written most of the code so I wonder, do you have any
>> feedback on Kudu's APIs? Any weird things you noticed? Any gotchas?
>>
>> We're getting close to 1.0, so we still have some time to make
>> (potentially breaking) changes.
>>
>> Thanks!
>>
>> J-D
>>
>> On Wed, Jun 29, 2016 at 3:48 AM, Karthik Prasad Manchala <
>> karthikp.manch...@impetus.co.in> wrote:
>>
>>> Hi all,
>>>
>>>
>>> Kundera <https://github.com/impetus-opensource/Kundera> being one of
>>> the most popular JPA provider for NoSql datastores has added support for
>>> basic CRUD operations and Select queries on Kudu. Please feel free to
>>> explore more using the below link.
>>>
>>>
>>> - https://github.com/impetus-opensource/Kundera/wiki/Kundera-with-Kudu
>>>
>>>
>>> Thanks and regards,
>>>
>>> Team Kundera.
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>>
>&

Re: Kundera - JPA compliant Object Datastore Mapper for Kudu

2016-06-30 Thread Jean-Daniel Cryans
Cool, great to hear, Karthik :)

We do need some sort of roadmap on the website/wiki/or just as a list of
jiras with tags. Right now we've been mostly executing on this:
http://mail-archives.apache.org/mod_mbox/kudu-dev/201602.mbox/%3ccagptdncmbwwx8p+ygkzhfl2xcmktscu-rhlcqfsns1uvsbr...@mail.gmail.com%3E

We currently don't recommend running Kudu in production, so you won't find
many instances of it. This will change with 1.0 as described in the link
above. You can still find some companies using Kudu, for example Xiaomi's
use case is described from slide #13 in this deck:
http://www.slideshare.net/cloudera/apache-kudu-incubating-new-hadoop-storage-for-fast-analytics-on-fast-data-by-todd-lipcon-software-engineer-cloudera-kudu-founder

As for insights on using Kudu with Object Mappers, you tell me :) I think
one thing missing that'll make it even more powerful would be to have
nested types support in Kudu. There's a jira for it but no one's working on
it: https://issues.apache.org/jira/browse/KUDU-1261

Cheers,

J-D

On Thu, Jun 30, 2016 at 1:48 AM, Karthik Prasad Manchala <
karthikp.manch...@impetus.co.in> wrote:

> Hi J-D,
>
>
> I have positive feedback on Kudu's API. I can say it is one of the easiest
> APIs I have used till date.
>
>
> There are no such weird things that I noticed but I did not come across
> any road map of Kudu for the features to be planned. It would be
> amazing to know.
>
>
> Also, I was wondering if you could share any instances where Kudu is being
> used in production? And some insights on using Kudu with an Object
> Mapper tool like Kundera?
>
>
> Thanks,
>
> Karthik.
> --
> *From:* Jean-Daniel Cryans 
> *Sent:* 29 June 2016 22:40
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Kundera - JPA compliant Object Datastore Mapper for Kudu
>
> Hi Karthik!
>
> Thanks for sharing this.
>
> I see that you've written most of the code so I wonder, do you have any
> feedback on Kudu's APIs? Any weird things you noticed? Any gotchas?
>
> We're getting close to 1.0, so we still have some time to make
> (potentially breaking) changes.
>
> Thanks!
>
> J-D
>
> On Wed, Jun 29, 2016 at 3:48 AM, Karthik Prasad Manchala <
> karthikp.manch...@impetus.co.in> wrote:
>
>> Hi all,
>>
>>
>> Kundera <https://github.com/impetus-opensource/Kundera> being one of the
>> most popular JPA provider for NoSql datastores has added support for basic
>> CRUD operations and Select queries on Kudu. Please feel free to explore
>> more using the below link.
>>
>>
>> - https://github.com/impetus-opensource/Kundera/wiki/Kundera-with-Kudu
>>
>>
>> Thanks and regards,
>>
>> Team Kundera.
>>
>> --
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Kundera - JPA compliant Object Datastore Mapper for Kudu

2016-06-29 Thread Jean-Daniel Cryans
Hi Karthik!

Thanks for sharing this.

I see that you've written most of the code so I wonder, do you have any
feedback on Kudu's APIs? Any weird things you noticed? Any gotchas?

We're getting close to 1.0, so we still have some time to make (potentially
breaking) changes.

Thanks!

J-D

On Wed, Jun 29, 2016 at 3:48 AM, Karthik Prasad Manchala <
karthikp.manch...@impetus.co.in> wrote:

> Hi all,
>
>
> Kundera  being one of the
> most popular JPA provider for NoSql datastores has added support for basic
> CRUD operations and Select queries on Kudu. Please feel free to explore
> more using the below link.
>
>
> - https://github.com/impetus-opensource/Kundera/wiki/Kundera-with-Kudu
>
>
> Thanks and regards,
>
> Team Kundera.
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Kudu for Debian system

2016-06-17 Thread Jean-Daniel Cryans
Hi Murugan,

The Cloudera convenience packages are only released for RHEL 6, 7 and
Ubuntu 14.04 as per this page:
http://www.cloudera.com/documentation/betas/kudu/latest/topics/kudu_installation.html

If you have other questions regarding those binaries, please post them on
this Cloudera forum:
community.cloudera.com/t5/Beta-Releases-Apache-Kudu/bd-p/Beta

BTW I think you should be able to compile Kudu on Debian 8, but we haven't
tested it as you can see here https://gerrit.cloudera.org/#/c/3389/

Cheers,

J-D

On Fri, Jun 17, 2016 at 5:32 AM, Murugan Sundararaj ( Retail Platform -
ASF)  wrote:

> I do not find wheezy or squeeze distro "
> http://archive.cloudera.com/beta/kudu/debian/wheezy/amd64/kudu/dists/";
>
> Does that mean kudu is not supported to run on Debian system?
>
> Thanks,
> Murugan
>


Re: Kudu QuickStart VM 0.9.0?

2016-06-15 Thread Jean-Daniel Cryans
It's up, it's a different filename since I also upgraded from 5.4.9 to
5.7.1 so you'll need to update your kudu-examples repo first.

J-D

On Wed, Jun 15, 2016 at 10:19 AM, Tom White  wrote:

> Thanks J-D.
>
> Tom
>
> On Wed, Jun 15, 2016 at 6:06 PM, Jean-Daniel Cryans 
> wrote:
> > Hey Tom,
> >
> > Yeah it's on me to update it, trying to get that done this week.
> >
> > J-D
> >
> > On Wed, Jun 15, 2016 at 10:04 AM, Tom White  wrote:
> >>
> >> Hi,
> >>
> >> I tried downloading the VM for the new release, but it looks like it's
> >> still on 0.7.0:
> >>
> >>
> >>
> https://github.com/cloudera/kudu-examples/commit/9a22e9f6280094f029c049a7776cce3458150e7f
> >>
> >> Are there plans to update it? I find it very useful for trying out Kudu.
> >>
> >> Thanks!
> >> Tom
> >
> >
>


Re: Kudu QuickStart VM 0.9.0?

2016-06-15 Thread Jean-Daniel Cryans
Hey Tom,

Yeah it's on me to update it, trying to get that done this week.

J-D

On Wed, Jun 15, 2016 at 10:04 AM, Tom White  wrote:

> Hi,
>
> I tried downloading the VM for the new release, but it looks like it's
> still on 0.7.0:
>
>
> https://github.com/cloudera/kudu-examples/commit/9a22e9f6280094f029c049a7776cce3458150e7f
>
> Are there plans to update it? I find it very useful for trying out Kudu.
>
> Thanks!
> Tom
>


Re: Spark on Kudu

2016-06-14 Thread Jean-Daniel Cryans
It's only in Cloudera's maven repo:
https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/

J-D

On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim  wrote:

> Hi J-D,
>
> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for
> spark-shell to use. Can you show me where to find it?
>
> Thanks,
> Ben
>
>
> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans 
> wrote:
>
> What's in this doc is what's gonna get released:
> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>
> J-D
>
> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim  wrote:
>
>> Will this be documented with examples once 0.9.0 comes out?
>>
>> Thanks,
>> Ben
>>
>>
>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans 
>> wrote:
>>
>> It will be in 0.9.0.
>>
>> J-D
>>
>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim  wrote:
>>
>>> Hi Chris,
>>>
>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On May 18, 2016, at 9:01 AM, Chris George 
>>> wrote:
>>>
>>> There is some code in review that needs some more refinement.
>>> It will allow upsert/insert from a dataframe using the datasource api.
>>> It will also allow the creation and deletion of tables from a dataframe
>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>
>>> Example usages will look something like:
>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>
>>> -Chris George
>>>
>>>
>>> On 5/18/16, 9:45 AM, "Benjamin Kim"  wrote:
>>>
>>> Can someone tell me what the state is of this Spark work?
>>>
>>> Also, does anyone have any sample code on how to update/insert data in
>>> Kudu using DataFrames?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Apr 13, 2016, at 8:22 AM, Chris George 
>>> wrote:
>>>
>>> SparkSQL cannot support these type of statements but we may be able to
>>> implement similar functionality through the api.
>>> -Chris
>>>
>>> On 4/12/16, 5:19 PM, "Benjamin Kim"  wrote:
>>>
>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if
>>> it were to be implemented.
>>>
>>> MERGE INTO table_name USING table_reference ON (condition)
>>>  WHEN MATCHED THEN
>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>  WHEN NOT MATCHED THEN
>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Apr 11, 2016, at 12:21 PM, Chris George 
>>> wrote:
>>>
>>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>>> gerrit if you want to take a look.
>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>> It does pushdown predicates which the existing input formatter based rdd
>>> does not.
>>>
>>> Within the next two weeks I’m planning to implement a datasource for
>>> spark that will have pushdown predicates and insertion/update functionality
>>> (need to look more at cassandra and the hbase datasource for best way to do
>>> this) I agree that server side upsert would be helpful.
>>> Having a datasource would give us useful data frames and also make spark
>>> sql usable for kudu.
>>>
>>> My reasoning for having a spark datasource and not using Impala is: 1.
>>> We have had trouble getting impala to run fast with high concurrency when
>>> compared to spark 2. We interact with datasources which do not integrate
>>> with impala. 3. We have custom sql query planners for extended sql
>>> functionality.
>>>
>>> -Chris George
>>>
>>>
>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans"  wrote:
>>>
>>> You guys make a convincing point, although on the upsert side we'll need
>>> more support from the servers. Right now all you can do is an INSERT then,
>>> if you get a dup key, do an UPDATE. I guess we could at least add an API on
>>> the client side that would manage it, but it wouldn't be atomic.
>>>
>>> J-D
>>>
>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>>> wrote:
>>>
>>>> It's pretty simple, actually.  I need to support versioned datasets in
>>>

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Jean-Daniel Cryans
(only replying on user@)

Hi Ben,

For Cloudera-specific question, please use this forum:
http://community.cloudera.com/t5/Beta-Releases-Apache-Kudu/bd-p/Beta

Short answer is that what you're asking for should be available this week.
It always lags the ASF release by a few days.

Cheers,

J-D

On Mon, Jun 13, 2016 at 9:56 AM, Benjamin Kim  wrote:

> Hi J-D,
>
> I would like to get this started especially now that UPSERT and Spark SQL
> DataFrames support. But, how do I use Cloudera Manager to deploy it? Is
> there a parcel available yet? Is there a new CSD file to be downloaded? I
> currently have CM 5.7.0 installed.
>
> Thanks,
> Ben
>
>
>
> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans 
> wrote:
>
> The Apache Kudu (incubating) team is happy to announce the release of Kudu
> 0.9.0!
>
> Kudu is an open source storage engine for structured data which supports
> low-latency random access together with efficient analytical access
> patterns. It is designed within the context of the Apache Hadoop ecosystem
> and supports many integrations with other data analytics projects both
> inside and outside of the Apache Software Foundation.
>
> This latest version adds basic UPSERT functionality and an improved Apache
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also
> improves Tablet Server restart time as well as write performance under high
> load. Finally, Kudu now enforces the specification of a partitioning scheme
> for new tables.
>
> Download it here: http://getkudu.io/releases/0.9.0/
>
> Regards,
>
> The Apache Kudu (incubating) team
>
> ===
>
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.
>
>
>


[ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-10 Thread Jean-Daniel Cryans
The Apache Kudu (incubating) team is happy to announce the release of Kudu
0.9.0!

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

This latest version adds basic UPSERT functionality and an improved Apache
Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also
improves Tablet Server restart time as well as write performance under high
load. Finally, Kudu now enforces the specification of a partitioning scheme
for new tables.

Download it here: http://getkudu.io/releases/0.9.0/

Regards,

The Apache Kudu (incubating) team

===

Apache Kudu (incubating) is an effort undergoing incubation at The Apache
Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making
process have stabilized in a manner consistent with other successful
ASF projects. While incubation status is not necessarily a reflection
of the completeness or stability of the code, it does indicate that
the project has yet to be fully endorsed by the ASF.


Re: CDH, Impala, Impala KUDU Versioning

2016-06-08 Thread Jean-Daniel Cryans
That's actually a question for Cloudera, so moving user@ to bcc and adding
cdh-user@ back.

On Wed, Jun 8, 2016 at 5:13 PM, Ana Krasteva  wrote:

>
> ​Forwarding this to Kudu user group.
>
> ​
>
> On Wed, Jun 8, 2016 at 5:09 PM, Pavan Kulkarni 
> wrote:
>
>> Hi,
>>
>>I am using the latest CDH available 5.7 (
>> https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.7/RPMS/x86_64/)
>> and also trying to setup
>> Impala-Kudu (
>> http://archive.cloudera.com/beta/impala-kudu/redhat/6/x86_64/impala-kudu/0.8/RPMS/x86_64/
>> )
>> and
>> KUDU (
>> http://archive.cloudera.com/beta/kudu/redhat/6/x86_64/kudu/0/RPMS/x86_64/
>> )
>>
>> but I see that latest Impala-KUDU is built with CDH 5.8 and latest KUDU
>> is built with CDH 5.4
>>
>> *1. Can I know what exactly is the process/reasoning behind building
>>  Impala-KUDU and KUDU with different CDH?*
>>
>
It's not "built" with CDH it's more like "built using the CDH build tools"
as of a certain version. It's really an implementation detail, and it has
no impact whatsoever on the version of CDH you're actually using, as long
as it's 5.4.3 or later.


> *2. How do I decide on which is the stable CDH to use since different
>> rpm's are built with different CDH ?*
>>
>
See 1, and read
http://www.cloudera.com/documentation/betas/kudu/latest/topics/kudu_installation.html#concept_cmn_ngq_dt


>
>> Any response is really appreciated.
>>
>> -- Thanks
>> Pavan Kulkarni
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "CDH Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to cdh-user+unsubscr...@cloudera.org.
>> For more options, visit https://groups.google.com/a/cloudera.org/d/optout
>> .
>>
>
>


Re: Spark on Kudu

2016-06-08 Thread Jean-Daniel Cryans
What's in this doc is what's gonna get released:
https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark

J-D

On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim  wrote:

> Will this be documented with examples once 0.9.0 comes out?
>
> Thanks,
> Ben
>
>
> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans 
> wrote:
>
> It will be in 0.9.0.
>
> J-D
>
> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim  wrote:
>
>> Hi Chris,
>>
>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>
>> Thanks,
>> Ben
>>
>>
>> On May 18, 2016, at 9:01 AM, Chris George 
>> wrote:
>>
>> There is some code in review that needs some more refinement.
>> It will allow upsert/insert from a dataframe using the datasource api. It
>> will also allow the creation and deletion of tables from a dataframe
>> http://gerrit.cloudera.org:8080/#/c/2992/
>>
>> Example usages will look something like:
>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>
>> -Chris George
>>
>>
>> On 5/18/16, 9:45 AM, "Benjamin Kim"  wrote:
>>
>> Can someone tell me what the state is of this Spark work?
>>
>> Also, does anyone have any sample code on how to update/insert data in
>> Kudu using DataFrames?
>>
>> Thanks,
>> Ben
>>
>>
>> On Apr 13, 2016, at 8:22 AM, Chris George 
>> wrote:
>>
>> SparkSQL cannot support these type of statements but we may be able to
>> implement similar functionality through the api.
>> -Chris
>>
>> On 4/12/16, 5:19 PM, "Benjamin Kim"  wrote:
>>
>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it
>> were to be implemented.
>>
>> MERGE INTO table_name USING table_reference ON (condition)
>>  WHEN MATCHED THEN
>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>  WHEN NOT MATCHED THEN
>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>
>> Cheers,
>> Ben
>>
>> On Apr 11, 2016, at 12:21 PM, Chris George 
>> wrote:
>>
>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>> gerrit if you want to take a look.
>> http://gerrit.cloudera.org:8080/#/c/2754/
>> It does pushdown predicates which the existing input formatter based rdd
>> does not.
>>
>> Within the next two weeks I’m planning to implement a datasource for
>> spark that will have pushdown predicates and insertion/update functionality
>> (need to look more at cassandra and the hbase datasource for best way to do
>> this) I agree that server side upsert would be helpful.
>> Having a datasource would give us useful data frames and also make spark
>> sql usable for kudu.
>>
>> My reasoning for having a spark datasource and not using Impala is: 1. We
>> have had trouble getting impala to run fast with high concurrency when
>> compared to spark 2. We interact with datasources which do not integrate
>> with impala. 3. We have custom sql query planners for extended sql
>> functionality.
>>
>> -Chris George
>>
>>
>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans"  wrote:
>>
>> You guys make a convincing point, although on the upsert side we'll need
>> more support from the servers. Right now all you can do is an INSERT then,
>> if you get a dup key, do an UPDATE. I guess we could at least add an API on
>> the client side that would manage it, but it wouldn't be atomic.
>>
>> J-D
>>
>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>> wrote:
>>
>>> It's pretty simple, actually.  I need to support versioned datasets in a
>>> Spark SQL environment.  Instead of a hack on top of a Parquet data store,
>>> I'm hoping (among other reasons) to be able to use Kudu's write and
>>> timestamp-based read operations to support not only appending data, but
>>> also updating existing data, and even some schema migration.  The most
>>> typical use case is a dataset that is updated periodically (e.g., weekly or
>>> monthly) in which the the preliminary data in the previous window (week or
>>> month) is updated with values that are expected to remain unchanged from
>>> then on, and a new set of preliminary values for the current window need to
>>> be added/appended.
>>>
>>> Using Kudu's Java API and developing additional functionality on top of
>>> what Kudu has to offer isn't too much to ask, but the ease of integratio

Re: Kudu installation

2016-06-07 Thread Jean-Daniel Cryans
On Tue, Jun 7, 2016 at 2:14 PM, Roberta Marton 
wrote:

> I copied it from the installation guide:
>
> create TABLE my_first_table
> (
> id BIGINT,
> name STRING
> )
> DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
> TBLPROPERTIES(
> 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
> 'kudu.table_name' = 'my_first_table',
> 'kudu.master_addresses' = 'http://:8051/masters',
>

You need to only specify the address and RPC port for your master, not the
web UI's URL. It should be 'kudu.master_addresses' = ':7051'

Where was it specified with the http URL?


> 'kudu.key_columns' = 'id'
> );
>
>
>Roberta
>
> -Original Message-
> From: Adar Dembo [mailto:a...@cloudera.com]
> Sent: Tuesday, June 7, 2016 2:08 PM
> To: user@kudu.incubator.apache.org
> Subject: Re: Kudu installation
>
> Hmm, could you share your Impala CREATE TABLE statement?
>
>
> On Tue, Jun 7, 2016 at 2:03 PM, Roberta Marton 
> wrote:
> > Thanks!  I will check out the examples.
> >
> > I have installed both Kudu and Impala-kudu and am trying to create a
> > table though Impala.
> > I verified that Impala-kudu is setup by running the select statement
> > suggested on the installation page.
> >
> > When  I try to create a table, it is complaining that it can't find my
> > master.
> >
> > ERROR:
> > ImpalaRuntimeException: Error creating Kudu table CAUSED BY:
> > NonRecoverableException: Couldn't find a valid master in
> > ([http://:8051/masters]), exceptions:
> > [org.kududb.client.NonRecoverableException: Couldn't resolve this
> > master's address [http://:8051/masters]]
> >
> > I can go to http://:8051/masters  and see master details.
> > I am missing something in the configuration?
> >
> >Roberta
> >
> > -Original Message-
> > From: Adar Dembo [mailto:a...@cloudera.com]
> > Sent: Tuesday, June 7, 2016 1:32 PM
> > To: user@kudu.incubator.apache.org
> > Cc: u...@kudu.apache.org
> > Subject: Re: Kudu installation
> >
> > Hi Roberta,
> >
> > For the foreseeable future you still need to download the special
> > Impala Kudu.
> >
> > As for C++ API examples, check out this:
> > https://git-wip-us.apache.org/repos/asf?p=incubator-kudu.git;a=blob;f=
> > src/kudu/client/samples/sample.cc;h=43678221e30c5b44b06eae3298290192c5
> > ae42e9;hb=refs/heads/master
> >
> > On Tue, Jun 7, 2016 at 1:25 PM, Roberta Marton
> > 
> > wrote:
> >> I am installing apache kudu to try it out.
> >>
> >> The installation instructions state the following:
> >>
> >>
> >>
> >> Apache Kudu has tight integration with Apache Impala (incubating),
> >> allowing you to use Impala to insert, query, update, and delete data
> >> from Kudu tablets using Impala's SQL syntax, as an alternative to
> >> using the Kudu APIs to build a custom Kudu application. In addition,
> >> you can use JDBC or ODBC to connect existing or new applications
> >> written in any language, framework, or business intelligence tool to
> >> your Kudu data, using Impala as the broker.
> >>
> >> This integration relies on features that released versions of Impala
> >> do not have yet, as of Impala 2.3, which is expected to ship in CDH
> >> 5.5. In the interim, you need to install a fork of Impala, which this
> >> document will refer to as Impala_Kudu.
> >>
> >>
> >>
> >> I have CDH 5.7 installed, does it contain the necessary changes for
> >> Impala or do I still need to download impala-kudu.
> >>
> >>
> >>
> >> Also, do you have any example on how to use c++ api?
> >>
> >>
> >>
> >>Thanks
> >>
> >> Roberta
>


New blog post on the removal of default partitioning for new tables in 0.9

2016-06-03 Thread Jean-Daniel Cryans
Hi list,

Dan Burkert wrote a new blog post about an important change in the upcoming
0.9 version (currently voting on RC1), see:
http://getkudu.io/2016/06/02/no-default-partitioning.html

The post contains a link to the discussion that happened on dev@ if you
need more context.

Cheers,

J-D


Re: Spark on Kudu

2016-05-28 Thread Jean-Daniel Cryans
It will be in 0.9.0.

J-D

On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim  wrote:

> Hi Chris,
>
> Will all this effort be rolled into 0.9.0 and be ready for use?
>
> Thanks,
> Ben
>
>
> On May 18, 2016, at 9:01 AM, Chris George 
> wrote:
>
> There is some code in review that needs some more refinement.
> It will allow upsert/insert from a dataframe using the datasource api. It
> will also allow the creation and deletion of tables from a dataframe
> http://gerrit.cloudera.org:8080/#/c/2992/
>
> Example usages will look something like:
> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>
> -Chris George
>
>
> On 5/18/16, 9:45 AM, "Benjamin Kim"  wrote:
>
> Can someone tell me what the state is of this Spark work?
>
> Also, does anyone have any sample code on how to update/insert data in
> Kudu using DataFrames?
>
> Thanks,
> Ben
>
>
> On Apr 13, 2016, at 8:22 AM, Chris George 
> wrote:
>
> SparkSQL cannot support these type of statements but we may be able to
> implement similar functionality through the api.
> -Chris
>
> On 4/12/16, 5:19 PM, "Benjamin Kim"  wrote:
>
> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it
> were to be implemented.
>
> MERGE INTO table_name USING table_reference ON (condition)
>  WHEN MATCHED THEN
>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>  WHEN NOT MATCHED THEN
>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>
> Cheers,
> Ben
>
> On Apr 11, 2016, at 12:21 PM, Chris George 
> wrote:
>
> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit
> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/
> It does pushdown predicates which the existing input formatter based rdd
> does not.
>
> Within the next two weeks I’m planning to implement a datasource for spark
> that will have pushdown predicates and insertion/update functionality (need
> to look more at cassandra and the hbase datasource for best way to do this)
> I agree that server side upsert would be helpful.
> Having a datasource would give us useful data frames and also make spark
> sql usable for kudu.
>
> My reasoning for having a spark datasource and not using Impala is: 1. We
> have had trouble getting impala to run fast with high concurrency when
> compared to spark 2. We interact with datasources which do not integrate
> with impala. 3. We have custom sql query planners for extended sql
> functionality.
>
> -Chris George
>
>
> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans"  wrote:
>
> You guys make a convincing point, although on the upsert side we'll need
> more support from the servers. Right now all you can do is an INSERT then,
> if you get a dup key, do an UPDATE. I guess we could at least add an API on
> the client side that would manage it, but it wouldn't be atomic.
>
> J-D
>
> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
> wrote:
>
>> It's pretty simple, actually.  I need to support versioned datasets in a
>> Spark SQL environment.  Instead of a hack on top of a Parquet data store,
>> I'm hoping (among other reasons) to be able to use Kudu's write and
>> timestamp-based read operations to support not only appending data, but
>> also updating existing data, and even some schema migration.  The most
>> typical use case is a dataset that is updated periodically (e.g., weekly or
>> monthly) in which the the preliminary data in the previous window (week or
>> month) is updated with values that are expected to remain unchanged from
>> then on, and a new set of preliminary values for the current window need to
>> be added/appended.
>>
>> Using Kudu's Java API and developing additional functionality on top of
>> what Kudu has to offer isn't too much to ask, but the ease of integration
>> with Spark SQL will gate how quickly we would move to using Kudu and how
>> seriously we'd look at alternatives before making that decision.
>>
>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>> wrote:
>>
>>> Mark,
>>>
>>> Thanks for taking some time to reply in this thread, glad it caught the
>>> attention of other folks!
>>>
>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
>>> wrote:
>>>
>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>
>>>>
>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a
>>>> refactoring of some Spark SQL-oriented insert functionality while trying to
>>>> evalu

Re: CM Parcel Installs

2016-05-24 Thread Jean-Daniel Cryans
Hi Jordan,

Please use this forum:
http://community.cloudera.com/t5/Beta-Releases-Apache-Kudu/bd-p/Beta

Thx,

J-D

On Tue, May 24, 2016 at 1:04 PM, Jordan Birdsell <
jordan.birdsell.k...@statefarm.com> wrote:

> Hey all,
>
> Should questions/issues with the Kudu Parcels/CM be sent through this list
> or is there some alternate forum for those communications?
>
> Thanks,
> Jordan Birdsell
>
>
>


Re: best practices to remove/retire data

2016-05-12 Thread Jean-Daniel Cryans
It should be fully implemented for 1.0 which we're aiming for August. You
can follow this jira: https://issues.apache.org/jira/browse/KUDU-1306

J-D

On Thu, May 12, 2016 at 10:10 AM, Sand Stone  wrote:

> Thanks J-D.
>
> Any idea when the partition level deletion will be implemented?
>
> On Thu, May 12, 2016 at 8:24 AM, Jean-Daniel Cryans 
> wrote:
>
>> Hi,
>>
>> Right now this use case is more difficult than it needs to be. In your
>> previous thread, "Partition and Split rows", we talked about non-covering
>> range partition and this is something that would help your use case a lot.
>> Basically, you could create partitions that cover full days, and everyday
>> you could delete the old partitions while creating the next day's. Deleting
>> a partition is really quick and efficient compared to manually deleting
>> individual rows.
>>
>> Until this is available I'd do this with multiple table, but it's a mess
>> to handle as you described.
>>
>> Hope this helps,
>>
>> J-D
>>
>> On Thu, May 12, 2016 at 8:16 AM, Sand Stone 
>> wrote:
>>
>>> Hi. Presumably I need to write a program to delete the unwanted rows,
>>> say, remove all data older than 3 days, while the table is still ingesting
>>> new data.
>>>
>>> How well will this perform for large tables? Both deletion and ingestion
>>> wise.
>>>
>>> Or for this specific case that I retire data by day, I should create a
>>> new table per day. However then the users have to be aware of the table
>>> naming scheme somehow. If a mention policy is changed. all the client side
>>> code might have to change (sure we can have one level of indirection to
>>> minimize the pain).
>>>
>>> Thanks.
>>>
>>
>>
>


Re: best practices to remove/retire data

2016-05-12 Thread Jean-Daniel Cryans
Hi,

Right now this use case is more difficult than it needs to be. In your
previous thread, "Partition and Split rows", we talked about non-covering
range partition and this is something that would help your use case a lot.
Basically, you could create partitions that cover full days, and everyday
you could delete the old partitions while creating the next day's. Deleting
a partition is really quick and efficient compared to manually deleting
individual rows.

Until this is available I'd do this with multiple table, but it's a mess to
handle as you described.

Hope this helps,

J-D

On Thu, May 12, 2016 at 8:16 AM, Sand Stone  wrote:

> Hi. Presumably I need to write a program to delete the unwanted rows, say,
> remove all data older than 3 days, while the table is still ingesting new
> data.
>
> How well will this perform for large tables? Both deletion and ingestion
> wise.
>
> Or for this specific case that I retire data by day, I should create a new
> table per day. However then the users have to be aware of the table naming
> scheme somehow. If a mention policy is changed. all the client side code
> might have to change (sure we can have one level of indirection to minimize
> the pain).
>
> Thanks.
>


Re: Partition and Split rows

2016-05-06 Thread Jean-Daniel Cryans
We do have non-covering range partitions coming in the next few months,
here's the design (in review):
http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md

The "Background & Motivation" section should give you a good idea of why
I'm mentioning this.

Meanwhile, if you don't need row locality, using hash partitioning could be
good enough.

J-D

On Fri, May 6, 2016 at 3:53 PM, Sand Stone  wrote:

> Makes sense.
>
> Yeah it would be cool if users could specify/control the split rows after
> the table is created. Now, I have to "think ahead" to pre-create the range
> buckets.
>
> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans 
> wrote:
>
>> You will only get 1 tablet and no data distribution, which is bad.
>>
>> That's also how HBase works, but it will split regions as you insert data
>> and eventually you'll get some data distribution even if it doesn't start
>> in an ideal situation. Tablet splitting will come later for Kudu.
>>
>> J-D
>>
>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone 
>> wrote:
>>
>>> One more questions, how does the range partition work if I don't specify
>>> the split rows?
>>>
>>> Thanks!
>>>
>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone 
>>> wrote:
>>>
>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>
>>>> I was just reading the Java API,CreateTableOptions.java, it's unclear
>>>> how the range partition column names associated with the partial rows
>>>> params in the addSplitRow API.
>>>>
>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>> mstanleyjo...@cloudera.com> wrote:
>>>>
>>>>> Hi Sand,
>>>>>
>>>>> Please have a look at
>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>> and see if it is helpful to you.
>>>>>
>>>>> Thanks,
>>>>> Misty
>>>>>
>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone 
>>>>> wrote:
>>>>>
>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know from
>>>>>> some docs, this is currently for pre-creation the table. I am researching
>>>>>> how to partition (hash+range) some time series test data.
>>>>>>
>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>
>>>>>> Thanks much.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Partition and Split rows

2016-05-06 Thread Jean-Daniel Cryans
You will only get 1 tablet and no data distribution, which is bad.

That's also how HBase works, but it will split regions as you insert data
and eventually you'll get some data distribution even if it doesn't start
in an ideal situation. Tablet splitting will come later for Kudu.

J-D

On Fri, May 6, 2016 at 3:42 PM, Sand Stone  wrote:

> One more questions, how does the range partition work if I don't specify
> the split rows?
>
> Thanks!
>
> On Fri, May 6, 2016 at 3:37 PM, Sand Stone  wrote:
>
>> Thanks, Misty. The "advanced" impala example helped.
>>
>> I was just reading the Java API,CreateTableOptions.java, it's unclear how
>> the range partition column names associated with the partial rows params in
>> the addSplitRow API.
>>
>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>> mstanleyjo...@cloudera.com> wrote:
>>
>>> Hi Sand,
>>>
>>> Please have a look at
>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>> and see if it is helpful to you.
>>>
>>> Thanks,
>>> Misty
>>>
>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone 
>>> wrote:
>>>
 Hi, I am new to Kudu. I wonder how the split rows work. I know from
 some docs, this is currently for pre-creation the table. I am researching
 how to partition (hash+range) some time series test data.

 Is there an example? or notes somewhere I could read upon.

 Thanks much.

>>>
>>>
>>
>


Re: Weekly update 4/25

2016-04-26 Thread Jean-Daniel Cryans
Oh I see so this is in order to comply with asks such as "much sure that
data for some user/customer is 100% deleted"? We'll still have the problem
where we don't want to rewrite all the base data files (GBs/TBs) to clean
up KBs of data, although since a single row is always only part of one row
set, it means it's at most 64MB that you'd be rewriting.

BTW is it ok if the data isn't immediately deleted? How long is it
acceptable to wait for before it happens?

J-D

On Tue, Apr 26, 2016 at 8:04 AM, Jordan Birdsell <
jordan.birdsell.k...@statefarm.com> wrote:

> Correct.  As for the “latest version”, if a row is deleted in the latest
> version then removing the old versions where it existed is exactly what
> we’re looking to do.  Basically, we need a way to physically get rid of
> select rows (or data within a column for that matter) and all versions of
> that row or column data.
>
>
>
> *From:* Jean-Daniel Cryans [mailto:jdcry...@apache.org]
> *Sent:* Tuesday, April 26, 2016 10:56 AM
> *To:* user@kudu.incubator.apache.org
> *Subject:* Re: Weekly update 4/25
>
>
>
> Hi Jordan,
>
>
>
> In other words, you'd like to tag specific rows to be excluded from the
> default data history retention?
>
>
>
> Also, keep in mind that this improvement is about removing old versions of
> the data, it will not delete the latest version. If you are used to HBase,
> it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
> age out a row.
>
>
>
> Hope this helps,
>
>
>
> J-D
>
>
>
> On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:t...@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* d...@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>
>
>


Re: Weekly update 4/25

2016-04-26 Thread Jean-Daniel Cryans
Hi Jordan,

In other words, you'd like to tag specific rows to be excluded from the
default data history retention?

Also, keep in mind that this improvement is about removing old versions of
the data, it will not delete the latest version. If you are used to HBase,
it's like specifying some TTL plus MIN_VERSIONS=1 so it doesn't completely
age out a row.

Hope this helps,

J-D

On Tue, Apr 26, 2016 at 4:29 AM, Jordan Birdsell <
jordan.birdsell.k...@statefarm.com> wrote:

> Hi,
>
>
>
> Regarding row GC,  I see in the design document that the tablet history
> max age will be set at the table level, would it be possible to make this
> something that can be overridden for specific transactions?  We have some
> use cases that would require accelerated removal of data from disk and
> other use cases that would not have the same requirement. Unfortunately,
> these different use cases apply, often times, to the same tables.
>
>
>
> Thanks,
>
> Jordan Birdsell
>
>
>
> *From:* Todd Lipcon [mailto:t...@apache.org]
> *Sent:* Monday, April 25, 2016 1:54 PM
> *To:* d...@kudu.incubator.apache.org; user@kudu.incubator.apache.org
> *Subject:* Weekly update 4/25
>
>
>
> Hey Kudu-ers,
>
>
>
> For the last month and a half, I've been posting weekly summaries of
> community development activity on the Kudu blog. In case you aren't on
> twitter or slack you might not have seen the posts, so I'm going to start
> emailing them to the list as well.
>
>
>
> Here's this week's update:
>
> http://getkudu.io/2016/04/25/weekly-update.html
>
>
>
> Feel free to reply to this mail if you have any questions or would like to
> get involved in development.
>
>
>
> -Todd
>


Re: Exception at inserting big amount of data

2016-04-26 Thread Jean-Daniel Cryans
Hi Juan Pablo,

The error basically means that the client didn't hear from the server after
sending the data, even after retrying a few times, and reached the default
10 seconds timeout. Can you run your insert again and then capture the
output of this command?

curl -s http://10.0.6.157:8050/metrics | gzip - > metrics.gz

Then post that file somewhere we can download. I you have more than one
tablet server, it might be a different node, basically I want the one that
ends up listed in this exception on the right:

Caused by: org.kududb.client.ConnectionResetException: [Peer
f7e2936b040d4c58b52d90ae50ad6d5a] Connection reset on [id: 0x323019c2, /
10.0.6.6:58930 :> /10.0.6.157:7050]

Also, can we see the logs from that node around 10AM on 16/04/26?

Finally, I'm surprised you're even able to create your table if you only
have one tablet server and a replication of 2 (unless you meant to say that
your master node has both a master and a tablet server).

J-D

On Tue, Apr 26, 2016 at 7:06 AM, Juan Pablo Briganti <
juan.briga...@globant.com> wrote:

> Hi!!
>
>   We are facing some errors trying to insert multiple records into the
> database through the Java API.
>   We have a simple cluster composed of 1 master node and 1 slave node. 1
> table with 1 bigint primary key and 1 string. The String length is more or
> less 30 characters. The table has a replication of 2.
>   We are using 0.8 on both the cluster and java API and performing Manual
> Flush mode with a maximum of 10.000 rows per flush (it is only a maximum,
> not all flush operations insert the same amount of data).
>   After 2 or 3 successful flushes (more or less 6000 records) we are
> receiving the error attached to this email.
>   We started receiving this error a few weeks ago, when we were using the
> 0.7.1 version.
>   Did this happen to other people? any ideas on what can be wrong? Any
> help would be appreciated.
>   If you need more info let me know.
>
> Thanks, Juan Pablo.
>
> --
> *Juan Pablo Briganti* | Data Architect
> *GLOBANT* | AR: +54 11 4109 1700 ext. 19508 | US: +1 877 215 5230 ext.
> 19508 |
> [image: Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Linkedin]
>  [image: Pinterest]
>  [image: Globant] 
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>


Re: impala got duplicate fields after compute stats on kudu table

2016-04-18 Thread Jean-Daniel Cryans
Hi Darren,

Is this with Impala Kudu 0.8.0 or a previous version?

Thx,

J-D

On Mon, Apr 18, 2016 at 7:29 AM, Darren Hoo  wrote:

> 1. create a kudu table  to test
>
>
> create table t2 (
>
> id  INT,
>
> cid INT
>
> )
>
> TBLPROPERTIES(
>
>   'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
>
>   'kudu.table_name' = 't2',
>
>   'kudu.key_columns' = 'id',
>
>   'kudu.master_addresses' = 'master:7051'
>
> );
>
>
>
> 2. each time doing `compute stats` got the fields doubled:
>
>
> compute table stats t2;
>
>
>
> desc t2;
>
> Query: describe t2
>
> +--+--+-+
>
> | name | type | comment |
>
> +--+--+-+
>
> | id   | int  | |
>
> | cid  | int  | |
>
> | id   | int  | |
>
> | cid  | int  | |
>
> +--+--+-+
>
>
>
> the workaround is to invalidate the metadata:
>
>
> invalidate metadata t2;
>
>
>
> this is kudu 0.8.0 on cdh5.7
>


Re: Spark-kudu: java.lang.IllegalArgumentException: Got out-of-order primary key column

2016-04-18 Thread Jean-Daniel Cryans
Hi Darren,

That particular error means that the schema was created with key columns
specified after non-key columns, a current limitation in Kudu.  It seems
like internally Spark is creating a schema with that kind of setup?

J-D

On Mon, Apr 18, 2016 at 7:20 AM, Darren Hoo  wrote:

> what does this Exception mean?
>
> I Just do an inner join of two kudu tables, and I got this:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 2.0 (TID 22, slave12):
> java.lang.IllegalArgumentException: Got out-of-order primary key column:
> Column name: cid, type: int64
>
> at org.kududb.Schema.(Schema.java:110)
>
> at org.kududb.Schema.(Schema.java:74)
>
> at
> org.kududb.client.AsyncKuduScanner.(AsyncKuduScanner.java:313)
>
> at
> org.kududb.client.KuduScanner$KuduScannerBuilder.build(KuduScanner.java:131)
>
> at
> org.kududb.mapreduce.KuduTableInputFormat$TableRecordReader.initialize(KuduTableInputFormat.java:386)
>
> at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:158)
>
> at
> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
>
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:64)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> the same SQL runs ok on impala, and the same  code runs ok on spark
> 1.5(cdh 5.5), but fails with spark 1.6 (cdh 5.7).
>
> what possible thing can I do wrong?
>
>
>


Cloudera-packaged binaries for 0.8.0 now available, plus a new Impala Kudu

2016-04-15 Thread Jean-Daniel Cryans
(putting my Cloudera hat on for this email)

Hi,

We just released parcels and packages for Kudu 0.8.0, they can be found in
their usual places or by following the installation instructions:
http://www.cloudera.com/documentation/betas/kudu/latest/topics/kudu_installation.html

We also released a new "Impala Kudu" version based on a recent snapshot of
Impala's trunk. The previous releases were based on a fork from last summer
with a lot of code added for Kudu integration, so this new version includes
months worth of improvements and bug fixes.

If you have questions or issues that are specific to the Cloudera packages,
please use the following forum:
community.cloudera.com/t5/Beta-Releases-Apache-Kudu/bd-p/Beta

Cheers,

J-D


Re: Spark on Kudu

2016-04-11 Thread Jean-Daniel Cryans
You guys make a convincing point, although on the upsert side we'll need
more support from the servers. Right now all you can do is an INSERT then,
if you get a dup key, do an UPDATE. I guess we could at least add an API on
the client side that would manage it, but it wouldn't be atomic.

J-D

On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
wrote:

> It's pretty simple, actually.  I need to support versioned datasets in a
> Spark SQL environment.  Instead of a hack on top of a Parquet data store,
> I'm hoping (among other reasons) to be able to use Kudu's write and
> timestamp-based read operations to support not only appending data, but
> also updating existing data, and even some schema migration.  The most
> typical use case is a dataset that is updated periodically (e.g., weekly or
> monthly) in which the the preliminary data in the previous window (week or
> month) is updated with values that are expected to remain unchanged from
> then on, and a new set of preliminary values for the current window need to
> be added/appended.
>
> Using Kudu's Java API and developing additional functionality on top of
> what Kudu has to offer isn't too much to ask, but the ease of integration
> with Spark SQL will gate how quickly we would move to using Kudu and how
> seriously we'd look at alternatives before making that decision.
>
> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
> wrote:
>
>> Mark,
>>
>> Thanks for taking some time to reply in this thread, glad it caught the
>> attention of other folks!
>>
>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
>> wrote:
>>
>>> Do they care being able to insert into Kudu with SparkSQL
>>>
>>>
>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a
>>> refactoring of some Spark SQL-oriented insert functionality while trying to
>>> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting
>>> inserts with Spark SQL will be a key consideration as to whether we adopt
>>> Kudu.
>>>
>>
>> I'd like to know more about why SparkSQL inserts in necessary for you. Is
>> it just that you currently do it that way into some database or parquet so
>> with minimal refactoring you'd be able to use Kudu? Would re-writing those
>> SQL lines into Scala and directly use the Java API's KuduSession be too
>> much work?
>>
>> Additionally, what do you expect to gain from using Kudu VS your current
>> solution? If it's not completely clear, I'd love to help you think through
>> it.
>>
>>
>>>
>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>> jdcry...@apache.org> wrote:
>>>
>>>> Yup, starting to get a good idea.
>>>>
>>>> What are your DS folks looking for in terms of functionality related to
>>>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they
>>>> care being able to insert into Kudu with SparkSQL or just being able to
>>>> query real fast? Anything more specific to Spark that I'm missing?
>>>>
>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera
>>>> all our resources are committed to making things happen in time, and a more
>>>> fully featured Spark integration isn't in our plans during that period. I'm
>>>> really hoping someone in the community will help with Spark, the same way
>>>> we got a big contribution for the Flume sink.
>>>>
>>>> J-D
>>>>
>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim 
>>>> wrote:
>>>>
>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But,
>>>>> since it’s not “production-ready”, upper management doesn’t want to fully
>>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so 
>>>>> much
>>>>> simpler and easier to use in every aspect compared to HBase. Impala was
>>>>> great for the report writers and analysts to experiment with for the short
>>>>> time it was up. But, once again, the only blocker was the lack of Spark
>>>>> support for our Data Developers/Scientists. So, production-level data
>>>>> population won’t happen until then.
>>>>>
>>>>> I hope this helps you get an idea where I am coming from…
>>>>>
>>>>> Cheers,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Apr 10, 2016, 

[ANNOUNCE] Apache Kudu (incubating) 0.8.0 released

2016-04-11 Thread Jean-Daniel Cryans
The Apache Kudu (incubating) team is happy to announce the release of Kudu
0.8.0!

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

This latest version adds a sink for Apache Flume, partition pruning in the
C++ client and related improvements on the server-side, better
error-handling in Java client, plus many other improvements and bug fixes.

Download it here: http://getkudu.io/releases/0.8.0/

Regards,

The Apache Kudu (incubating) team

===

Apache Kudu (incubating) is an effort undergoing incubation at The Apache
Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making
process have stabilized in a manner consistent with other successful
ASF projects. While incubation status is not necessarily a reflection
of the completeness or stability of the code, it does indicate that
the project has yet to be fully endorsed by the ASF.


Re: Spark on Kudu

2016-04-11 Thread Jean-Daniel Cryans
Ben,

Thanks for the additional information. You know, I was expecting that
querying would be the most important part and writing into Kudu was
secondary since it can easily be done with the Java API, but you guys are
proving me wrong.

I'm starting to think we should host a Spark + Kudu hackathon here in the
Bay Area. Bringing experts together from both sides might unlock some
potential. We did that with Drill and it was successful:
https://issues.apache.org/jira/browse/DRILL-4241

J-D

On Sun, Apr 10, 2016 at 1:03 PM, Benjamin Kim  wrote:

> J-D,
>
> Priority is data population of tables using DataFrames. That’s all I heard
> the most. It is the same with HBase. But, I bet once this is taken care of,
> the fast querying part would follow because the data is now in Kudu. If
> SparkSQL integration is there, that would simplify things even more. That
> wouldn’t be bad to have.
>
> Cheers,
> Ben
>
>
> On Apr 10, 2016, at 12:23 PM, Jean-Daniel Cryans 
> wrote:
>
> Yup, starting to get a good idea.
>
> What are your DS folks looking for in terms of functionality related to
> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they
> care being able to insert into Kudu with SparkSQL or just being able to
> query real fast? Anything more specific to Spark that I'm missing?
>
> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all
> our resources are committed to making things happen in time, and a more
> fully featured Spark integration isn't in our plans during that period. I'm
> really hoping someone in the community will help with Spark, the same way
> we got a big contribution for the Flume sink.
>
> J-D
>
> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim  wrote:
>
>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since
>> it’s not “production-ready”, upper management doesn’t want to fully deploy
>> it yet. They just want to keep an eye on it though. Kudu was so much
>> simpler and easier to use in every aspect compared to HBase. Impala was
>> great for the report writers and analysts to experiment with for the short
>> time it was up. But, once again, the only blocker was the lack of Spark
>> support for our Data Developers/Scientists. So, production-level data
>> population won’t happen until then.
>>
>> I hope this helps you get an idea where I am coming from…
>>
>> Cheers,
>> Ben
>>
>>
>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>> wrote:
>>
>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim 
>> wrote:
>>
>>> J-D,
>>>
>>> The main thing I hear that Cassandra is being used as an updatable hot
>>> data store to ensure that duplicates are taken care of and idempotency is
>>> maintained. Whether data was directly retrieved from Cassandra for
>>> analytics, reports, or searches, it was not clear as to what was its main
>>> use. Some also just used it for a staging area to populate downstream
>>> tables in parquet format. The last thing I heard was that CQL was terrible,
>>> so that rules out much use of direct queries against it.
>>>
>>
>> I'm no C* expert, but I don't think CQL is meant for real analytics, just
>> ease of use instead of plainly using the APIs. Even then, Kudu should beat
>> it easily on big scans. Same for HBase. We've done benchmarks against the
>> latter, not the former.
>>
>>
>>>
>>> As for our company, we have been looking for an updatable data store for
>>> a long time that can be quickly queried directly either using Spark SQL or
>>> Impala or some other SQL engine and still handle TB or PB of data without
>>> performance degradation and many configuration headaches. For now, we are
>>> using HBase to take on this role with Phoenix as a fast way to directly
>>> query the data. I can see Kudu as the best way to fill this gap easily,
>>> especially being the closest thing to other relational databases out there
>>> in familiarity for the many SQL analytics people in our company. The other
>>> alternative would be to go with AWS Redshift for the same reasons, but it
>>> would come at a cost, of course. If we went with either solutions, Kudu or
>>> Redshift, it would get rid of the need to extract from HBase to parquet
>>> tables or export to PostgreSQL to support more of the SQL language using by
>>> analysts or the reporting software we use..
>>>
>>
>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu.
>> Have you folks tried Kudu with Impala yet with tho

Re: Spark on Kudu

2016-04-11 Thread Jean-Daniel Cryans
Mark,

Thanks for taking some time to reply in this thread, glad it caught the
attention of other folks!

On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
wrote:

> Do they care being able to insert into Kudu with SparkSQL
>
>
> I care about insert into Kudu with Spark SQL.  I'm currently delaying a
> refactoring of some Spark SQL-oriented insert functionality while trying to
> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting
> inserts with Spark SQL will be a key consideration as to whether we adopt
> Kudu.
>

I'd like to know more about why SparkSQL inserts in necessary for you. Is
it just that you currently do it that way into some database or parquet so
with minimal refactoring you'd be able to use Kudu? Would re-writing those
SQL lines into Scala and directly use the Java API's KuduSession be too
much work?

Additionally, what do you expect to gain from using Kudu VS your current
solution? If it's not completely clear, I'd love to help you think through
it.


>
> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
> wrote:
>
>> Yup, starting to get a good idea.
>>
>> What are your DS folks looking for in terms of functionality related to
>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do they
>> care being able to insert into Kudu with SparkSQL or just being able to
>> query real fast? Anything more specific to Spark that I'm missing?
>>
>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all
>> our resources are committed to making things happen in time, and a more
>> fully featured Spark integration isn't in our plans during that period. I'm
>> really hoping someone in the community will help with Spark, the same way
>> we got a big contribution for the Flume sink.
>>
>> J-D
>>
>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim 
>> wrote:
>>
>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since
>>> it’s not “production-ready”, upper management doesn’t want to fully deploy
>>> it yet. They just want to keep an eye on it though. Kudu was so much
>>> simpler and easier to use in every aspect compared to HBase. Impala was
>>> great for the report writers and analysts to experiment with for the short
>>> time it was up. But, once again, the only blocker was the lack of Spark
>>> support for our Data Developers/Scientists. So, production-level data
>>> population won’t happen until then.
>>>
>>> I hope this helps you get an idea where I am coming from…
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>>> wrote:
>>>
>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim 
>>> wrote:
>>>
>>>> J-D,
>>>>
>>>> The main thing I hear that Cassandra is being used as an updatable hot
>>>> data store to ensure that duplicates are taken care of and idempotency is
>>>> maintained. Whether data was directly retrieved from Cassandra for
>>>> analytics, reports, or searches, it was not clear as to what was its main
>>>> use. Some also just used it for a staging area to populate downstream
>>>> tables in parquet format. The last thing I heard was that CQL was terrible,
>>>> so that rules out much use of direct queries against it.
>>>>
>>>
>>> I'm no C* expert, but I don't think CQL is meant for real analytics,
>>> just ease of use instead of plainly using the APIs. Even then, Kudu should
>>> beat it easily on big scans. Same for HBase. We've done benchmarks against
>>> the latter, not the former.
>>>
>>>
>>>>
>>>> As for our company, we have been looking for an updatable data store
>>>> for a long time that can be quickly queried directly either using Spark SQL
>>>> or Impala or some other SQL engine and still handle TB or PB of data
>>>> without performance degradation and many configuration headaches. For now,
>>>> we are using HBase to take on this role with Phoenix as a fast way to
>>>> directly query the data. I can see Kudu as the best way to fill this gap
>>>> easily, especially being the closest thing to other relational databases
>>>> out there in familiarity for the many SQL analytics people in our company.
>>>> The other alternative would be to go with AWS Redshift for the same
>>>> reasons, but it would come at a cost, of course. If we went with either
>>>> solut

Re: Spark on Kudu

2016-04-10 Thread Jean-Daniel Cryans
Yup, starting to get a good idea.

What are your DS folks looking for in terms of functionality related to
Spark? A SparkSQL integration that's as fully featured as Impala's? Do they
care being able to insert into Kudu with SparkSQL or just being able to
query real fast? Anything more specific to Spark that I'm missing?

FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all
our resources are committed to making things happen in time, and a more
fully featured Spark integration isn't in our plans during that period. I'm
really hoping someone in the community will help with Spark, the same way
we got a big contribution for the Flume sink.

J-D

On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim  wrote:

> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since
> it’s not “production-ready”, upper management doesn’t want to fully deploy
> it yet. They just want to keep an eye on it though. Kudu was so much
> simpler and easier to use in every aspect compared to HBase. Impala was
> great for the report writers and analysts to experiment with for the short
> time it was up. But, once again, the only blocker was the lack of Spark
> support for our Data Developers/Scientists. So, production-level data
> population won’t happen until then.
>
> I hope this helps you get an idea where I am coming from…
>
> Cheers,
> Ben
>
>
> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
> wrote:
>
> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim  wrote:
>
>> J-D,
>>
>> The main thing I hear that Cassandra is being used as an updatable hot
>> data store to ensure that duplicates are taken care of and idempotency is
>> maintained. Whether data was directly retrieved from Cassandra for
>> analytics, reports, or searches, it was not clear as to what was its main
>> use. Some also just used it for a staging area to populate downstream
>> tables in parquet format. The last thing I heard was that CQL was terrible,
>> so that rules out much use of direct queries against it.
>>
>
> I'm no C* expert, but I don't think CQL is meant for real analytics, just
> ease of use instead of plainly using the APIs. Even then, Kudu should beat
> it easily on big scans. Same for HBase. We've done benchmarks against the
> latter, not the former.
>
>
>>
>> As for our company, we have been looking for an updatable data store for
>> a long time that can be quickly queried directly either using Spark SQL or
>> Impala or some other SQL engine and still handle TB or PB of data without
>> performance degradation and many configuration headaches. For now, we are
>> using HBase to take on this role with Phoenix as a fast way to directly
>> query the data. I can see Kudu as the best way to fill this gap easily,
>> especially being the closest thing to other relational databases out there
>> in familiarity for the many SQL analytics people in our company. The other
>> alternative would be to go with AWS Redshift for the same reasons, but it
>> would come at a cost, of course. If we went with either solutions, Kudu or
>> Redshift, it would get rid of the need to extract from HBase to parquet
>> tables or export to PostgreSQL to support more of the SQL language using by
>> analysts or the reporting software we use..
>>
>
> Ok, the usual then *smile*. Looks like we're not too far off with Kudu.
> Have you folks tried Kudu with Impala yet with those use cases?
>
>
>>
>> I hope this helps.
>>
>
> It does, thanks for nice reply.
>
>
>>
>> Cheers,
>> Ben
>>
>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans 
>> wrote:
>>
>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer
>> to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who
>> were also there did say that the hype around Spark isn't dying down.
>>
>> There's definitely an overlap in the use cases that Cassandra, HBase, and
>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim
>> solution for the use case you describe.
>>
>> Nothing significant happened in Kudu over the past month, it's a storage
>> engine so things move slowly *smile*. I'd love to see more contributions on
>> the Spark front. I know there's code out there that could be integrated in
>> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily
>> review it.
>>
>> Do you have relevant experiences you can share? I'd love to learn more
>> about the use cases for which you envision using Kudu as a C* replacement.
>>
>> Th

Re: Spark on Kudu

2016-04-10 Thread Jean-Daniel Cryans
On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim  wrote:

> J-D,
>
> The main thing I hear that Cassandra is being used as an updatable hot
> data store to ensure that duplicates are taken care of and idempotency is
> maintained. Whether data was directly retrieved from Cassandra for
> analytics, reports, or searches, it was not clear as to what was its main
> use. Some also just used it for a staging area to populate downstream
> tables in parquet format. The last thing I heard was that CQL was terrible,
> so that rules out much use of direct queries against it.
>

I'm no C* expert, but I don't think CQL is meant for real analytics, just
ease of use instead of plainly using the APIs. Even then, Kudu should beat
it easily on big scans. Same for HBase. We've done benchmarks against the
latter, not the former.


>
> As for our company, we have been looking for an updatable data store for a
> long time that can be quickly queried directly either using Spark SQL or
> Impala or some other SQL engine and still handle TB or PB of data without
> performance degradation and many configuration headaches. For now, we are
> using HBase to take on this role with Phoenix as a fast way to directly
> query the data. I can see Kudu as the best way to fill this gap easily,
> especially being the closest thing to other relational databases out there
> in familiarity for the many SQL analytics people in our company. The other
> alternative would be to go with AWS Redshift for the same reasons, but it
> would come at a cost, of course. If we went with either solutions, Kudu or
> Redshift, it would get rid of the need to extract from HBase to parquet
> tables or export to PostgreSQL to support more of the SQL language using by
> analysts or the reporting software we use..
>

Ok, the usual then *smile*. Looks like we're not too far off with Kudu.
Have you folks tried Kudu with Impala yet with those use cases?


>
> I hope this helps.
>

It does, thanks for nice reply.


>
> Cheers,
> Ben
>
> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans 
> wrote:
>
> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to
> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who
> were also there did say that the hype around Spark isn't dying down.
>
> There's definitely an overlap in the use cases that Cassandra, HBase, and
> Kudu cater to. I wouldn't go as far as saying that C* is just an interim
> solution for the use case you describe.
>
> Nothing significant happened in Kudu over the past month, it's a storage
> engine so things move slowly *smile*. I'd love to see more contributions on
> the Spark front. I know there's code out there that could be integrated in
> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily
> review it.
>
> Do you have relevant experiences you can share? I'd love to learn more
> about the use cases for which you envision using Kudu as a C* replacement.
>
> Thanks,
>
> J-D
>
> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim  wrote:
>
>> Hi J-D,
>>
>> My colleagues recently came back from Strata in San Jose. They told me
>> that everything was about Spark and there is a big buzz about the SMACK
>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra
>> is just an interim solution as a low-latency, easily queried data store. I
>> was wondering if anything significant happened in regards to Kudu,
>> especially on the Spark front. Plus, can you come up with your own proposed
>> stack acronym to promote?
>>
>> Cheers,
>> Ben
>>
>>
>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>> wrote:
>>
>> Hi Ben,
>>
>> AFAIK no one in the dev community committed to any timeline. I know of
>> one person on the Kudu Slack who's working on a better RDD, but that's
>> about it.
>>
>> Regards,
>>
>> J-D
>>
>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim  wrote:
>>
>>> Hi J-D,
>>>
>>> Quick question… Is there an ETA for KUDU-1214? I want to target a
>>> version of Kudu to begin real testing of Spark against it for our devs. At
>>> least, I can tell them what timeframe to anticipate.
>>>
>>> Just curious,
>>> *Benjamin Kim*
>>> *Data Solutions Architect*
>>>
>>> [a•mo•bee] *(n.)* the company defining digital marketing.
>>>
>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |
>>> www.amobee.com
>>>
>>&g

Re: Spark on Kudu

2016-04-09 Thread Jean-Daniel Cryans
Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to
"Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who
were also there did say that the hype around Spark isn't dying down.

There's definitely an overlap in the use cases that Cassandra, HBase, and
Kudu cater to. I wouldn't go as far as saying that C* is just an interim
solution for the use case you describe.

Nothing significant happened in Kudu over the past month, it's a storage
engine so things move slowly *smile*. I'd love to see more contributions on
the Spark front. I know there's code out there that could be integrated in
kudu-spark, it just needs to land in gerrit. I'm sure folks will happily
review it.

Do you have relevant experiences you can share? I'd love to learn more
about the use cases for which you envision using Kudu as a C* replacement.

Thanks,

J-D

On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim  wrote:

> Hi J-D,
>
> My colleagues recently came back from Strata in San Jose. They told me
> that everything was about Spark and there is a big buzz about the SMACK
> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra
> is just an interim solution as a low-latency, easily queried data store. I
> was wondering if anything significant happened in regards to Kudu,
> especially on the Spark front. Plus, can you come up with your own proposed
> stack acronym to promote?
>
> Cheers,
> Ben
>
>
> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
> wrote:
>
> Hi Ben,
>
> AFAIK no one in the dev community committed to any timeline. I know of one
> person on the Kudu Slack who's working on a better RDD, but that's about it.
>
> Regards,
>
> J-D
>
> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim  wrote:
>
>> Hi J-D,
>>
>> Quick question… Is there an ETA for KUDU-1214? I want to target a version
>> of Kudu to begin real testing of Spark against it for our devs. At least, I
>> can tell them what timeframe to anticipate.
>>
>> Just curious,
>> *Benjamin Kim*
>> *Data Solutions Architect*
>>
>> [a•mo•bee] *(n.)* the company defining digital marketing.
>>
>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |
>> www.amobee.com
>>
>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>> wrote:
>>
>> The DStream stuff isn't there at all. I'm not sure if it's needed either.
>>
>> The kuduRDD is just leveraging the MR input format, ideally we'd use
>> scans directly.
>>
>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's
>> really basic.
>>
>> The goal was to provide something for others to contribute to. We have
>> some basic unit tests that others can easily extend. None of us on the team
>> are Spark experts, but we'd be really happy to assist one improve the
>> kudu-spark code.
>>
>> J-D
>>
>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim  wrote:
>>
>>> J-D,
>>>
>>> It looks like it fulfills most of the basic requirements (kudu RDD, kudu
>>> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL
>>> functionality (Dataframes) and doing the documentation, what more needs to
>>> be done? Optimizations?
>>>
>>> I believe that it’s a good place to start using Spark with Kudu and
>>> compare it to HBase with Spark (not clean).
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>> wrote:
>>>
>>> AFAIK no one is working on it, but we did manage to get this in for
>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>
>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but
>>> it will require a lot more work to make it fast/useful.
>>>
>>> Hope this helps,
>>>
>>> J-D
>>>
>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim 
>>> wrote:
>>>
>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> 
>>>> targeted
>>>> for 0.8.0, but I see no progress on it. When this is complete, will this
>>>> mean that Spark will be able to work with Kudu both programmatically and as
>>>> a client via Spark SQL? Or is there more work that needs to be done on the
>>>> Spark side for it to work?
>>>>
>>>> Just curious.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>
>>>
>>
>>
>
>


Re: Please welcome Binglin Chang as a Kudu committer and PPMC member

2016-04-05 Thread Jean-Daniel Cryans
Welcome to the team, Binglin!

On Mon, Apr 4, 2016 at 9:11 PM, Todd Lipcon  wrote:

> Hi Kudu community,
>
> On behalf of the Apache Kudu PPMC, I am please to announce that Binglin
> Chang has been elected as our newest PPMC member and committer. Binglin has
> been contributing to Kudu steadily over the last year in many different
> ways -- from speaking at conferences, to finding and fixing bugs, to adding
> new features, Binglin has been a great asset to the project.
>
> Please join me in congratulating Binglin! Thank you for your work so far
> and we hope to see your involvement continue and grow over the coming
> years.
>
> -Todd and the rest of the PPMC
>


Re: How to enable per-column compression when create table from impala?

2016-03-31 Thread Jean-Daniel Cryans
Hi Darren,

It's currently not supported in Impala, but you can do it via the Java or
C++ clients.

J-D

On Thu, Mar 31, 2016 at 1:21 AM, Darren Hoo  wrote:

> From the documentation
>
> http://getkudu.io/docs/schema_design.html#compression
>
> So how can I specify one column to be compressed when creating the table?
>
> I've looked through the documentation but could not figure out the exact
> SQL syntax.
>
> Any hints?
>


[ANNOUNCE] Apache Kudu (incubating) 0.7.1 released

2016-03-10 Thread Jean-Daniel Cryans
The Apache Kudu (incubating) team is happy to announce the release of Kudu
0.7.1!

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

This latest version fixes several bugs found during and after the release
of 0.7.0.

Download it here: http://getkudu.io/releases/0.7.1/

Regards,

The Apache Kudu (incubating) team

===

Apache Kudu (incubating) is an effort undergoing incubation at The Apache
Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making
process have stabilized in a manner consistent with other successful
ASF projects. While incubation status is not necessarily a reflection
of the completeness or stability of the code, it does indicate that
the project has yet to be fully endorsed by the ASF.


Kudu 0.7.0 binary packages now available via Cloudera

2016-03-02 Thread Jean-Daniel Cryans
Hello Kudu community,

(Putting my Cloudera hat on for this email)

We released the binary packages for Kudu, they are available where you'd
expect them. We also released a refresh of Impala Kudu, it's still based on
the pre-2.3.0 fork but contains some important bug fixes.

Please see the announcement on the cdh-user mailing list:
http://qnalist.com/questions/6309362/kudu-0-7-0-downstream-release-today

Cheers,

J-D


Re: Spark kudu: scanner not found

2016-03-02 Thread Jean-Daniel Cryans
This looks like: https://issues.apache.org/jira/browse/KUDU-1343

We're rolling out a 0.7.1 which will have the fix. It's also really easy to
patch if you're already building your own java client.

Thanks,

J-D

On Wed, Mar 2, 2016 at 9:35 AM, Darren Hoo  wrote:

> when access kudu from Spark SQL, the task throws this error:
>
>
> 16/03/03 01:26:52 WARN client.AsyncKuduScanner:
> 5d3871ed20a642c28da2f711e3af712f pretends to not know
> KuduScanner(table=contents, tablet=5d3871ed20a642c28da2f711e3af712f,
> scannerId="2dca9145edf2469789ff851a2db2542a",
> scanRequestTimeout=1)
> org.kududb.client.TabletServerErrorException:
> Server[8c09eaddf6994d3583b4073447475f8d] NOT_FOUND[code 1]: Scanner
> not found
> at
> org.kududb.client.TabletClient.dispatchTSErrorOrReturnException(TabletClient.java:461)
> at org.kududb.client.TabletClient.decode(TabletClient.java:412)
> at org.kududb.client.TabletClient.decode(TabletClient.java:82)
> at
> org.kududb.client.shaded.org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
> at
> org.kududb.client.shaded.org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at org.kududb.client.TabletClient.handleUpstream(TabletClient.java:592)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> at
> org.kududb.client.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> at
> org.kududb.client.AsyncKuduClient$TabletClientPipeline.sendUpstream(AsyncKuduClient.java:1647)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> at
> org.kududb.client.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> at
> org.kududb.client.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> at
> org.kududb.client.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 16/03/03 01:26:52 ERROR executor.Executor: Exception in task 1.0 in
> stage 3.0 (TID 38)
> java.io.IOException: Couldn't get scan data
> at
> org.kududb.mapreduce.KuduTableInputFormat$TableRecordReader.tryRefreshIterator(KuduTableInputFormat.java:422)
> at
> org.kududb.mapreduce.KuduTableInputFormat$TableRecordReader.nextKeyValue(KuduTableInputFormat.java:401)
> at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at or

Re: Kudu 0.7.0

2016-03-01 Thread Jean-Daniel Cryans
Yup, hold on! :)

On Tue, Mar 1, 2016 at 1:21 PM, Benjamin Kim  wrote:

> Is there a special version of Impala coming out too?
>
> Thanks,
> Ben
>
>
> On Mar 1, 2016, at 9:51 AM, Jean-Daniel Cryans 
> wrote:
>
> It will be available very soon. The thing is we (Cloudera) can't start the
> binaries release process until the source release has been voted on.
>
> J-D
>
> On Tue, Mar 1, 2016 at 9:42 AM, Benjamin Kim  wrote:
>
>> Is the CSD for Cloudera Manager available for Kudu 0.7.0 or can we just
>> add the URL to the parcel list?
>>
>> Thanks,
>> Ben
>>
>>
>
>


Re: Spark on Kudu

2016-03-01 Thread Jean-Daniel Cryans
Hi Ben,

AFAIK no one in the dev community committed to any timeline. I know of one
person on the Kudu Slack who's working on a better RDD, but that's about it.

Regards,

J-D

On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim  wrote:

> Hi J-D,
>
> Quick question… Is there an ETA for KUDU-1214? I want to target a version
> of Kudu to begin real testing of Spark against it for our devs. At least, I
> can tell them what timeframe to anticipate.
>
> Just curious,
> *Benjamin Kim*
> *Data Solutions Architect*
>
> [a•mo•bee] *(n.)* the company defining digital marketing.
>
> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |
> www.amobee.com
>
> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
> wrote:
>
> The DStream stuff isn't there at all. I'm not sure if it's needed either.
>
> The kuduRDD is just leveraging the MR input format, ideally we'd use scans
> directly.
>
> The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's
> really basic.
>
> The goal was to provide something for others to contribute to. We have
> some basic unit tests that others can easily extend. None of us on the team
> are Spark experts, but we'd be really happy to assist one improve the
> kudu-spark code.
>
> J-D
>
> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim  wrote:
>
>> J-D,
>>
>> It looks like it fulfills most of the basic requirements (kudu RDD, kudu
>> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL
>> functionality (Dataframes) and doing the documentation, what more needs to
>> be done? Optimizations?
>>
>> I believe that it’s a good place to start using Spark with Kudu and
>> compare it to HBase with Spark (not clean).
>>
>> Thanks,
>> Ben
>>
>>
>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>> wrote:
>>
>> AFAIK no one is working on it, but we did manage to get this in for
>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>
>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but
>> it will require a lot more work to make it fast/useful.
>>
>> Hope this helps,
>>
>> J-D
>>
>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim  wrote:
>>
>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> targeted
>>> for 0.8.0, but I see no progress on it. When this is complete, will this
>>> mean that Spark will be able to work with Kudu both programmatically and as
>>> a client via Spark SQL? Or is there more work that needs to be done on the
>>> Spark side for it to work?
>>>
>>> Just curious.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>
>>
>
>


Re: Spark SQL on kudu can not contains nullable columns?

2016-03-01 Thread Jean-Daniel Cryans
Yeah didn't think about, are you volunteering Todd? :P I can do it today.

J-D

On Tue, Mar 1, 2016 at 9:57 AM, Todd Lipcon  wrote:

> Perhaps we should target this for 0.7.1 as well, if we're going to do that
> follow-up release? Seems like it should be an easy fix (and client-side
> only)
>
> -Todd
>
> On Tue, Mar 1, 2016 at 9:29 AM, Jean-Daniel Cryans 
> wrote:
>
>> Ha yeah that's a good one. I opened this jira:
>> https://issues.apache.org/jira/browse/KUDU-1360
>>
>> Basically we forgot to check for nulls :)
>>
>> J-D
>>
>> On Tue, Mar 1, 2016 at 9:18 AM, Darren Hoo  wrote:
>>
>>> Spark SQL on kudu can not contains nullable columns?
>>>
>>> I've create one table in kudu(0.6.0) which has nullable columns,
>>> when I try to use spark sql (using kudu java client 0.7.0) like this:
>>>
>>> sqlContext.load("org.kududb.spark",Map("kudu.table" -> "contents",
>>> "kudu.master" -> "master1:7051")).registerTempTable("contents")
>>> sqlContext.sql("SELECT * FROM * FROM contents limit 10").collectAsList()
>>>
>>> I got this error:
>>>
>>> 16/03/02 00:45:42 INFO DAGScheduler: Job 4 failed: collect at
>>> :20, took 11.813423 s
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 7.0 (TID 62, slave29): java.lang.IllegalArgumentException: The requested
>>> column (4)  is null
>>> at org.kududb.client.RowResult.checkNull(RowResult.java:475)
>>> at org.kududb.client.RowResult.getString(RowResult.java:321)
>>> at org.kududb.client.RowResult.getString(RowResult.java:308)
>>> at org.kududb.spark.KuduRelation.org
>>> $kududb$spark$KuduRelation$$getKuduValue(DefaultSource.scala:144)
>>> at
>>> org.kududb.spark.KuduRelation$$anonfun$buildScan$1$$anonfun$apply$1.apply(DefaultSource.scala:126)
>>> at
>>> org.kududb.spark.KuduRelation$$anonfun$buildScan$1$$anonfun$apply$1.apply(DefaultSource.scala:126)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> at
>>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> at
>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>> at
>>> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>> at
>>> org.kududb.spark.KuduRelation$$anonfun$buildScan$1.apply(DefaultSource.scala:126)
>>> at
>>> org.kududb.spark.KuduRelation$$anonfun$buildScan$1.apply(DefaultSource.scala:124)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>> at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>> at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>> at scala.collection.TraversableOnce$class.to
>>> (TraversableOnce.scala:273)
>>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>> at
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>> at
>>> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>> at
>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>>> at
>>> org

Re: Kudu 0.7.0

2016-03-01 Thread Jean-Daniel Cryans
It will be available very soon. The thing is we (Cloudera) can't start the
binaries release process until the source release has been voted on.

J-D

On Tue, Mar 1, 2016 at 9:42 AM, Benjamin Kim  wrote:

> Is the CSD for Cloudera Manager available for Kudu 0.7.0 or can we just
> add the URL to the parcel list?
>
> Thanks,
> Ben
>
>


Re: Spark SQL on kudu can not contains nullable columns?

2016-03-01 Thread Jean-Daniel Cryans
Ha yeah that's a good one. I opened this jira:
https://issues.apache.org/jira/browse/KUDU-1360

Basically we forgot to check for nulls :)

J-D

On Tue, Mar 1, 2016 at 9:18 AM, Darren Hoo  wrote:

> Spark SQL on kudu can not contains nullable columns?
>
> I've create one table in kudu(0.6.0) which has nullable columns,
> when I try to use spark sql (using kudu java client 0.7.0) like this:
>
> sqlContext.load("org.kududb.spark",Map("kudu.table" -> "contents",
> "kudu.master" -> "master1:7051")).registerTempTable("contents")
> sqlContext.sql("SELECT * FROM * FROM contents limit 10").collectAsList()
>
> I got this error:
>
> 16/03/02 00:45:42 INFO DAGScheduler: Job 4 failed: collect at
> :20, took 11.813423 s
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 7.0 (TID 62, slave29): java.lang.IllegalArgumentException: The requested
> column (4)  is null
> at org.kududb.client.RowResult.checkNull(RowResult.java:475)
> at org.kududb.client.RowResult.getString(RowResult.java:321)
> at org.kududb.client.RowResult.getString(RowResult.java:308)
> at org.kududb.spark.KuduRelation.org
> $kududb$spark$KuduRelation$$getKuduValue(DefaultSource.scala:144)
> at
> org.kududb.spark.KuduRelation$$anonfun$buildScan$1$$anonfun$apply$1.apply(DefaultSource.scala:126)
> at
> org.kududb.spark.KuduRelation$$anonfun$buildScan$1$$anonfun$apply$1.apply(DefaultSource.scala:126)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
> org.kududb.spark.KuduRelation$$anonfun$buildScan$1.apply(DefaultSource.scala:126)
> at
> org.kududb.spark.KuduRelation$$anonfun$buildScan$1.apply(DefaultSource.scala:124)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Is this due to the version incompatibily between my kudu server(0.6.0) and
> java client (0.7.0)?
>
>


Re: help on buiding Kudu Java Client 0.7.0

2016-03-01 Thread Jean-Daniel Cryans
See step 2: http://getkudu.io/docs/installation.html#rhel_from_source

Patching the client might just be easier than building all of thirdparties.

J-D

On Tue, Mar 1, 2016 at 9:10 AM, Darren Hoo  wrote:

> Thanks to Mike and Jean for your tip, I have built the java client now.
>
> But I still have some difficulties building third parties which I bypassed
> before building java client, especially when build_llvm:
>
> -- Looking for __atomic_fetch_add_4 in atomic
> -- Looking for __atomic_fetch_add_4 in atomic - not found
> CMake Error at cmake/modules/CheckAtomic.cmake:36 (message):
>   Host compiler appears to require libatomic, but cannot find it.
> Call Stack (most recent call first):
>   cmake/config-ix.cmake:291 (include)
>   CMakeLists.txt:360 (include)
>
> -- Configuring incomplete, errors occurred!
>
> I am using the 0.7.0 kudu release source tarball, and my platform is
> CentOS 6.6
>
> $  gcc -v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=
> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
> --enable-threads=posix --enable-checking=release --with-system-zlib
> --enable-__cxa_atexit --disable-libunwind-exceptions
> --enable-gnu-unique-object
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
> --enable-java-awt=gtk --disable-dssi
> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
> --enable-libgcj-multifile --enable-java-maintainer-mode
> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
> --build=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
>
> $ ls /usr/lib64/libatomic* -hla
>
> lrwxrwxrwx 1 root root  18 Mar  1 21:21 */usr/lib64/libatomic.so.1* ->
> *libatomic.so.1.1.0*
>
> -rwxr-xr-x 1 root root 24K Jul 24  2015 */usr/lib64/libatomic.so.1.1.0*
>
>
>
> I googled around for a while but did not find any solutions. Any ideas?
>
>


Re: help on buiding Kudu Java Client 0.7.0

2016-03-01 Thread Jean-Daniel Cryans
Hi Darren,

This was fixed in
https://github.com/cloudera/kudu/commit/7a0244c8c539dd800b7269c32a6826d2fdad43d9

If you can't apply the patch, the workaround is to build the third parties.

J-D

On Tue, Mar 1, 2016 at 1:30 AM, Darren Hoo  wrote:

> I have installed the exact version 2.6.1 of protoc
>
> $ which protoc
> /usr/local/bin/protoc
>
> $ protoc --version
> libprotoc 2.6.1
>
> but when I try to build the Java Client, I got this error:
>
>
> [INFO] Kudu ... SUCCESS [
> 1.375 s]
> [INFO] Kudu Annotations ... SUCCESS [
> 0.511 s]
> [INFO] Kudu Java Client ... FAILURE [
> 0.407 s]
> [INFO] Kudu's MapReduce bindings .. SKIPPED
> [INFO] Collection of tools that interact directly with Kudu SKIPPED
> [INFO] Kudu Spark Bindings  SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 2.520 s
> [INFO] Finished at: 2016-03-01T17:24:18+08:00
> [INFO] Final Memory: 35M/1932M
> [INFO]
> 
> [ERROR] Failed to execute goal
> com.google.protobuf.tools:maven-protoc-plugin:0.1.10:compile (default)
> on project kudu-client: protoc failed to execute because: null:
> IllegalArgumentException -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute goal com.google.protobuf.tools:maven-protoc-plugin:0.1.10:compile
> (default) on project kudu-client: protoc failed to execute because:
> null
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.MojoFailureException: protoc failed
> to execute because: null
> at
> com.google.protobuf.maven.AbstractProtocMojo.execute(AbstractProtocMojo.java:175)
> at
> com.google.protobuf.maven.ProtocCompileMojo.execute(ProtocCompileMojo.java:21)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: java.lang.IllegalArgumentException
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:70)
> at
> com.google.protobuf.maven.Protoc$Builder.addProtoPathElement(Protoc.java:191)
> at
> com.google.protobuf.maven.Protoc$Builder.addProtoPathElements(Protoc.java:201)
> at
> com.google.protobuf.maven.AbstractProtocMojo.execute(AbstractProtocMojo.java:157)
> ... 23 more
> [ERROR]
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
>
>
> what's the problem?
>


[ANNOUNCE] Apache Kudu (incubating) 0.7.0 released

2016-02-26 Thread Jean-Daniel Cryans
The Apache Kudu (incubating) team is happy to announce its first release as
part of the ASF Incubator, version 0.7.0!

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

This latest version adds limited support for Apache Spark, makes it
possible to build Kudu on more platforms, has a completely revamped Python
client, has improvements to both C++ and Java client libraries, and fixes
many bugs.

Download it here: *http://getkudu.io/releases/0.7.0/
*

Regards,

The Apache Kudu (incubating) team

===

Apache Kudu (incubating) is an effort undergoing incubation at The Apache
Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making
process have stabilized in a manner consistent with other successful
ASF projects. While incubation status is not necessarily a reflection
of the completeness or stability of the code, it does indicate that
the project has yet to be fully endorsed by the ASF.


Re: Spark on Kudu

2016-02-24 Thread Jean-Daniel Cryans
The DStream stuff isn't there at all. I'm not sure if it's needed either.

The kuduRDD is just leveraging the MR input format, ideally we'd use scans
directly.

The SparkSQL stuff is there but it doesn't do any sort of pushdown. It's
really basic.

The goal was to provide something for others to contribute to. We have some
basic unit tests that others can easily extend. None of us on the team are
Spark experts, but we'd be really happy to assist one improve the
kudu-spark code.

J-D

On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim  wrote:

> J-D,
>
> It looks like it fulfills most of the basic requirements (kudu RDD, kudu
> DStream) in KUDU-1214. Am I right? Besides shoring up more Spark SQL
> functionality (Dataframes) and doing the documentation, what more needs to
> be done? Optimizations?
>
> I believe that it’s a good place to start using Spark with Kudu and
> compare it to HBase with Spark (not clean).
>
> Thanks,
> Ben
>
>
> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
> wrote:
>
> AFAIK no one is working on it, but we did manage to get this in for 0.7.0:
> https://issues.cloudera.org/browse/KUDU-1321
>
> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but it
> will require a lot more work to make it fast/useful.
>
> Hope this helps,
>
> J-D
>
> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim  wrote:
>
>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> targeted
>> for 0.8.0, but I see no progress on it. When this is complete, will this
>> mean that Spark will be able to work with Kudu both programmatically and as
>> a client via Spark SQL? Or is there more work that needs to be done on the
>> Spark side for it to work?
>>
>> Just curious.
>>
>> Cheers,
>> Ben
>>
>>
>
>


Re: Spark on Kudu

2016-02-24 Thread Jean-Daniel Cryans
AFAIK no one is working on it, but we did manage to get this in for 0.7.0:
https://issues.cloudera.org/browse/KUDU-1321

It's a really simple wrapper, and yes you can use SparkSQL on Kudu, but it
will require a lot more work to make it fast/useful.

Hope this helps,

J-D

On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim  wrote:

> I see this KUDU-1214  targeted
> for 0.8.0, but I see no progress on it. When this is complete, will this
> mean that Spark will be able to work with Kudu both programmatically and as
> a client via Spark SQL? Or is there more work that needs to be done on the
> Spark side for it to work?
>
> Just curious.
>
> Cheers,
> Ben
>
>


Re: Kudu Release

2016-02-23 Thread Jean-Daniel Cryans
Hi Jordan, Alejandro,

This is not a mistake, security is not on my proposed roadmap for 1.0. The
reasoning for 1.0 is that "enough people" would want to deploy it in
production, obviously some are already doing it, so the other side of this
is that some folks wouldn't be able to deploy it yet. Speculating a little,
I could see security as a 2.0 feature.

But, as I said in my email, if anyone shows up with different priorities
(and more importantly patches), it could shape a different 1.0.

Hope this helps,

J-D

On Tue, Feb 23, 2016 at 11:52 AM, Alejandro de la Vina <
a.delav...@globant.com> wrote:

> Myself and my employing company share Jordans feeling on this perspective.
> I am most likely missing a particular issue/request on this matter since I
> still have a late Fall/early Winter expectation.
> Cheers.
>
>
> On Tue, Feb 23, 2016 at 4:45 PM, Jordan Birdsell <
> jordan.birdsell.k...@statefarm.com> wrote:
>
>> Is there any intention to add some form of security with 1.0?  I did not
>> see any mention of this in the thread.  I think this would be a must have
>> for most adopters, including my company.
>>
>>
>>
>> *From:* Jean-Daniel Cryans [mailto:jdcry...@apache.org]
>> *Sent:* Tuesday, February 23, 2016 12:38 PM
>> *To:* user@kudu.incubator.apache.org
>> *Subject:* Re: Kudu Release
>>
>>
>>
>> Thanks Ben :)
>>
>>
>>
>> Hopefully we can release 0.7.0 this week, the vote is currently happening
>> on the incubator general mailing list and closes tomorrow.
>>
>>
>>
>> J-D
>>
>>
>>
>> On Tue, Feb 23, 2016 at 9:27 AM, Benjamin Kim  wrote:
>>
>> Jean,
>>
>>
>>
>> Very organized outline. Looking forward to the 0.7 release. I am hoping
>> that most of your points are addressed and completed by 1.0 release this
>> fall.
>>
>>
>>
>> Thanks,
>>
>> Ben
>>
>>
>>
>>
>>
>> On Feb 23, 2016, at 8:31 AM, Jean-Daniel Cryans 
>> wrote:
>>
>>
>>
>> Hi Ben,
>>
>>
>>
>> Please see this thread on the dev list:
>> http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201602.mbox/%3CCAGpTDNcMBWwX8p%2ByGKzHfL2xcmKTScU-rhLcQFSns1UVSbrXhw%40mail.gmail.com%3E
>>
>>
>>
>> Thanks,
>>
>>
>>
>> J-D
>>
>>
>>
>> On Tue, Feb 23, 2016 at 8:23 AM, Benjamin Kim  wrote:
>>
>> Any word as to the release roadmap?
>>
>> Thanks,
>> Ben
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
> The information contained in this e-mail may be confidential. It has been
> sent for the sole use of the intended recipient(s). If the reader of this
> message is not an intended recipient, you are hereby notified that any
> unauthorized review, use, disclosure, dissemination, distribution or
> copying of this communication, or any of its contents,
> is strictly prohibited. If you have received it by mistake please let us
> know by e-mail immediately and delete it from your system. Many thanks.
>
>
>
> La información contenida en este mensaje puede ser confidencial. Ha sido
> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
> notificado que cualquier lectura, uso, publicación, diseminación,
> distribución o copiado de esta comunicación o su contenido está
> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
> por error le agradeceremos notificarnos por e-mail inmediatamente y
> eliminarlo de su sistema. Muchas gracias.
>
>


Re: Kudu Release

2016-02-23 Thread Jean-Daniel Cryans
Thanks Ben :)

Hopefully we can release 0.7.0 this week, the vote is currently happening
on the incubator general mailing list and closes tomorrow.

J-D

On Tue, Feb 23, 2016 at 9:27 AM, Benjamin Kim  wrote:

> Jean,
>
> Very organized outline. Looking forward to the 0.7 release. I am hoping
> that most of your points are addressed and completed by 1.0 release this
> fall.
>
> Thanks,
> Ben
>
>
> On Feb 23, 2016, at 8:31 AM, Jean-Daniel Cryans 
> wrote:
>
> Hi Ben,
>
> Please see this thread on the dev list:
> http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201602.mbox/%3CCAGpTDNcMBWwX8p%2ByGKzHfL2xcmKTScU-rhLcQFSns1UVSbrXhw%40mail.gmail.com%3E
>
> Thanks,
>
> J-D
>
> On Tue, Feb 23, 2016 at 8:23 AM, Benjamin Kim  wrote:
>
>> Any word as to the release roadmap?
>>
>> Thanks,
>> Ben
>>
>
>
>


Re: Kudu Release

2016-02-23 Thread Jean-Daniel Cryans
Hi Ben,

Please see this thread on the dev list:
http://mail-archives.apache.org/mod_mbox/incubator-kudu-dev/201602.mbox/%3CCAGpTDNcMBWwX8p%2ByGKzHfL2xcmKTScU-rhLcQFSns1UVSbrXhw%40mail.gmail.com%3E

Thanks,

J-D

On Tue, Feb 23, 2016 at 8:23 AM, Benjamin Kim  wrote:

> Any word as to the release roadmap?
>
> Thanks,
> Ben
>


Re: Version of Protobuf in the Java client

2016-02-11 Thread Jean-Daniel Cryans
Exact, an internal dependency, an implementation detail if you are just
consuming the jars. We've been careful not to even expose PB objects in the
public APIs.

J-D

On Thu, Feb 11, 2016 at 8:04 AM, Andrea Ferretti 
wrote:

> Thank you! So, if I understand correctly, version 2.6.1 is only needed
> for building, while consumers are free to mix other versions, right?
>
> 2016-02-11 17:01 GMT+01:00 Jean-Daniel Cryans :
> > My memory is a little fuzzy on why we require 2.6.1 specifically, the "it
> > needs to be the exact version" language came with this commit from Julien
> > without comments:
> >
> https://github.com/cloudera/kudu/commit/88a99036dda648f1ddbe7e17098de523994c0631
> >
> > The move up to that version happened in:
> >
> https://github.com/cloudera/kudu/commit/d92077ae93f095ff686d0dc7977712f4b55da0a0
> >
> > The latter commit explains how we use shading so that having other
> protobuf
> > versions on the classpath won't break the Java client.
> >
> > Hope this helps,
> >
> > J-D
> >
> > On Thu, Feb 11, 2016 at 7:15 AM, Andrea Ferretti <
> ferrettiand...@gmail.com>
> > wrote:
> >>
> >> I see that the instructions for the Java client mention "protobuf
> >> 2.6.1 (it needs to be the exact version)".
> >>
> >> Why is the exact version needed? Is there any chance that it will work
> >> at runtime having different versions of protobuf in the classpath,
> >> such as 2.5.0? What about newer versions such as protobuf 3?
> >
> >
>


Re: Version of Protobuf in the Java client

2016-02-11 Thread Jean-Daniel Cryans
My memory is a little fuzzy on why we require 2.6.1 specifically, the "it
needs to be the exact version" language came with this commit from Julien
without comments:
https://github.com/cloudera/kudu/commit/88a99036dda648f1ddbe7e17098de523994c0631

The move up to that version happened in:
https://github.com/cloudera/kudu/commit/d92077ae93f095ff686d0dc7977712f4b55da0a0

The latter commit explains how we use shading so that having other protobuf
versions on the classpath won't break the Java client.

Hope this helps,

J-D

On Thu, Feb 11, 2016 at 7:15 AM, Andrea Ferretti 
wrote:

> I see that the instructions for the Java client mention "protobuf
> 2.6.1 (it needs to be the exact version)".
>
> Why is the exact version needed? Is there any chance that it will work
> at runtime having different versions of protobuf in the classpath,
> such as 2.5.0? What about newer versions such as protobuf 3?
>


Re: Install kudu using cloudera manager parcels

2016-01-06 Thread Jean-Daniel Cryans
Hi Sun,

I replied to your question on the Cloudera Community forum:
http://community.cloudera.com/t5/Beta-Releases-Apache-Kudu/Kudu-install-using-cloudera-manager-reports-parcel-not-available/m-p/35870#M102

But yeah Cloudera doesn't provide RHEL 7 packages at the moment. AFAIK
building from source works on that platform.

J-D

On Tue, Jan 5, 2016 at 10:48 PM, Fulin Sun  wrote:

> Hi,  experts
> I am asking question about using Cloudera Manager to install kudu ,
> especially for using parcels.
> I follow the instructions hints here :
> http://www.cloudera.com/content/www/en-us/documentation/betas/kudu/latest/topics/kudu_installation.html#concept_vrz_wbq_dt_unique_2
>
>
>
> However, when I load the csd file for kudu and restart cloudera manager 
> server, I went to host -> parcels and found that the error reports as :
>
> Error for parcel KUDU-0.6.0-1.kudu0.6.0.p0.334-el7 : Parcel not available for 
> OS Distribution RHEL7.
>
> Does this mean that the parcel of kudu could not be supported in RHEL7 ?
> My environment got : Cloudera manager 5.5.1  CDH 5.5.1 CentOS 7.1
>
> Please help me on this and show some alternative workaround.
> Best,
> Sun.
>
> --
> --
>
> *Certus**Net*
>
>
>