Re: Spark on Kudu Roadmap

2017-03-27 Thread Benjamin Kim
Hi Mike,

I believe what we are looking for is this below. It is an often request use 
case.

Anyone know if the Spark package will ever allow for creating tables in Spark 
SQL?

Such as:
   CREATE EXTERNAL TABLE 
   USING org.apache.kudu.spark.kudu
   OPTIONS (Map("kudu.master" -> “", "kudu.table" -> 
“table-name”));

In this way, plain SQL can be used to do DDL, DML statements whether in Spark 
SQL code or using JDBC to interface with Spark SQL Thriftserver.

Thanks,
Ben


> On Mar 27, 2017, at 11:01 AM, Mike Percy  wrote:
> 
> Hi Ben,
> Is there anything in particular you are looking for?
> 
> Thanks,
> Mike
> 
> On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim  > wrote:
> Hi,
> 
> Are there any plans for deeper integration with Spark especially Spark SQL? 
> Is there a roadmap to look at, so I can know what to expect in the future?
> 
> Cheers,
> Ben
> 



Re: How to calculate the optimal value of `maintenance_manager_num_threads`

2017-03-27 Thread Todd Lipcon
Hi Jason,

On Fri, Mar 24, 2017 at 1:39 AM, Jason Heo  wrote:

> Hi,
>
> I'm using Apache Kudu 1.2 on CDH 5.10.
>
> Recently, after reading "Bulk write performance improvements for Kudu 1.4
> "
> I've noticed that `maintenance_manager_num_threads` is 4 for the 5
> spinning disks.
>
>
Yes, but I wouldn't take that as necessarily optimal. I'm now doing some
tests with 8 threads as a comparison point.


> In my cluster, each node has 10 SATA disks with RAID 1+0 (WAL and Data
> directory located in the same partition). As Todd suggested, bulk loading
> is doing in PK sorted manner. I think CPU usage and System Load of my
> cluster is not high at this moment, so I think it could be increased a
> little bit more.
>
> Would someone please suggest the number of my environment?
>

Increasing the number of maintenance threads may help if you are falling
behind on compaction and flushes. For compaction, you can tell if you are
falling behind by looking at the "bloom_lookups_per_op" metric. For
flushes, you may be falling behind if you see a lot of "memory pressure
rejections". One area for improvement in our tooling is adding some more
scripts and tools to make these types of diagnosis easier.

In general, it's a tradeoff: more MM threads means more resource
consumption, but possibly better performance. The tradeoff may be
non-linear, though (i.e doubling MM threads won't double performance!)

As Kudu is still a young project, we're still gathering operational
experience from users around topics like this. It would be great if you can
share back any results you find with the community.

Thanks

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Kudu on top of Alluxio

2017-03-27 Thread Mike Percy
+1 thanks for adding that Todd.

Mike


On Mon, Mar 27, 2017 at 9:55 AM, Todd Lipcon  wrote:

> On Sat, Mar 25, 2017 at 2:54 PM, Mike Percy  wrote:
>
>> Kudu currently relies on local storage on a POSIX file system. Right now
>> there is no support for S3, which would be interesting but is non-trivial
>> in certain ways (particularly if we wanted to rely on S3's replication and
>> disable Kudu's app-level replication).
>>
>> I would suggest using only either EXT4 or XFS file systems for production
>> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per
>> machine for the WAL and with the data disks on either SATA or SSD drives
>> depending on the workload. Anything else is untested AFAIK.
>>
>
> I would amend this and say that SSD for the WAL is nice to have, but not a
> requirement. We do lots of testing on non-SSD test clusters and I'm aware
> of many production clusters which also do not have SSD.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: Spark on Kudu Roadmap

2017-03-27 Thread Mike Percy
Hi Ben,
Is there anything in particular you are looking for?

Thanks,
Mike

On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim  wrote:

> Hi,
>
> Are there any plans for deeper integration with Spark especially Spark
> SQL? Is there a roadmap to look at, so I can know what to expect in the
> future?
>
> Cheers,
> Ben


Re: Kudu on top of Alluxio

2017-03-27 Thread Todd Lipcon
On Sat, Mar 25, 2017 at 2:54 PM, Mike Percy  wrote:

> Kudu currently relies on local storage on a POSIX file system. Right now
> there is no support for S3, which would be interesting but is non-trivial
> in certain ways (particularly if we wanted to rely on S3's replication and
> disable Kudu's app-level replication).
>
> I would suggest using only either EXT4 or XFS file systems for production
> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per
> machine for the WAL and with the data disks on either SATA or SSD drives
> depending on the workload. Anything else is untested AFAIK.
>

I would amend this and say that SSD for the WAL is nice to have, but not a
requirement. We do lots of testing on non-SSD test clusters and I'm aware
of many production clusters which also do not have SSD.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Spark on Kudu Roadmap

2017-03-27 Thread Benjamin Kim
Hi,

Are there any plans for deeper integration with Spark especially Spark SQL? Is 
there a roadmap to look at, so I can know what to expect in the future?

Cheers,
Ben