Re: [ANNOUNCE] Welcoming Abhishek Chennaka as Kudu committer and PMC member

2023-02-22 Thread Todd Lipcon
Congrats on the recognition of you work, Abhishek!

Todd

On Wed, Feb 22, 2023, 6:15 PM 邓科  wrote:

> Congrats Abhishek!!!
>
> Yingchun Lai  于2023年2月23日周四 08:05写道:
>
> > Congrats!
> >
> > Mahesh Reddy 于2023年2月23日 周四03:07写道:
> >
> > > Congrats Abhishek!!! Great work and well deserved!
> > >
> > > On Wed, Feb 22, 2023 at 9:12 AM Andrew Wong  wrote:
> > >
> > > > Hi Kudu community,
> > > >
> > > > I'm happy to announce that the Kudu PMC has voted to add Abhishek
> > > Chennaka
> > > > as a
> > > > new committer and PMC member.
> > > >
> > > > Some of Abhishek's contributions include:
> > > > - Improving the usability of ksck and backup/restore tooling
> > > > - Introducing bootstrapping metrics and webserver pages
> > > > - Adding alter-table support for the dynamic hash ranges project
> > > > - Driving the on-going effort for partition-local incrementing
> columns
> > > >
> > > > Abhishek has also been a helpful participant in the project Slack
> and a
> > > > steady
> > > > reviewer on Gerrit.
> > > >
> > > > Please join me in congratulating Abhishek!
> > > >
> > >
> > --
> > Best regards,
> > Yingchun Lai
> >
>


Re: Implications/downside of increasing rpc_service_queue_length

2020-04-20 Thread Todd Lipcon
Hi Mauricio,

Sorry for the late reply on this one. Hope "better late than never" is the
case here :)

As you implied in your email, the main issue with increasing queue length
to deal with queue overflows is that it only helps with momentary spikes.
According to queueing theory (and intuition) if the rate of arrival of
entries into a queue is faster than the rate of processing items in that
queue, then the queue length will grow. If this is a transient phenomenon
(eg a quick burst of requests) then having a larger queue capacity will
prevent overflows, but if this is a persistent phenomenon, then there is no
length of queue that is sufficient to prevent overflows. The one exception
here is that if the number of potential concurrent queue entries is itself
bounded (eg because there is a bounded number of clients).

According to the above theory, the philosophy behind the default short
queue is that longer queues aren't a real solution if the cluster is
overloaded. That said, if you think that the issues are just transient
spikes rather than a capacity overload, it's possible that bumping the
queue length (eg to 100) can help here.

In terms of things to be aware of: having a longer queue means that the
amount of memory taken by entries in the queue is increased proportionally.
Currenlty, that memory is not tracked as part of Kudu's Memtracker
infrastructure, but it does get accounted for in the global heap and can
push the serve into "memory pressure" mode where requests will start
getting rejected, rowsets will get flushed, etc. I would recommend that if
you increase your queues you make sure that you have a relatively larger
memory limit allocated to your tablet servers and watch out for log
messages and metrics indicating persistent memory pressure (particularly in
the 80%+ range where things start getting dropped a lot).

Long queues are also potentially an issue in terms of low-latency requests.
The longer the queue (in terms of items) the longer the latency of elements
waiting in that queue. If you have some element of latency SLAs, you should
monitor them closely as you change queue length configuration.

Hope that helps

-Todd


Re: WAL size estimation

2019-06-26 Thread Todd Lipcon
Hey Pavel,

I went back and looked at the source here. It appears that 24MB is the
expected size for an index file -- each entry is 24 bytes and the index
file should keep 1M entries.

That said, for a "cold tablet" (in which you'd have only a small number of
actual WAL files) I would expect only a single index file. The example you
gave where you have 12 index files but only one WAL segment seems quite
fishy to me. Having 12 index files indicates you have 12M separate WAL
entries, but given you have only 8MB of WAL, that indicates each entry is
less than one byte large, which doesn't make much sense at all.

If you go back and look at that same tablet now, did it eventually GC those
log index files?

-Todd



On Wed, Jun 19, 2019 at 1:53 AM Pavel Martynov  wrote:

> > Try adding the '-p' flag here? That should show preallocated extents.
> Would be interesting to run it on some index file which is larger than 1MB,
> for example.
>
> # du -h --apparent-size index.00108
> 23M index.00108
>
> # du -h index.00108
> 23M index.00108
>
> # xfs_bmap -v -p index.00108
> index.00108:
>  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSET  TOTAL
> FLAGS
>0: [0..2719]:   1175815920..1175818639  2 (3704560..3707279)  2720
> 0
>1: [2720..5111]:1175828904..1175831295  2 (3717544..3719935)  2392
> 0
>2: [5112..7767]:1175835592..1175838247  2 (3724232..3726887)  2656
> 0
>3: [7768..10567]:   1175849896..1175852695  2 (3738536..3741335)  2800
> 0
>4: [10568..15751]:  1175877808..1175882991  2 (3766448..3771631)  5184
> 0
>5: [15752..18207]:  1175898864..1175901319  2 (3787504..3789959)  2456
> 0
>6: [18208..20759]:  1175909192..1175911743  2 (3797832..3800383)  2552
> 0
>7: [20760..23591]:  1175921616..1175924447  2 (3810256..3813087)  2832
> 0
>8: [23592..26207]:  1175974872..1175977487  2 (3863512..3866127)  2616
> 0
>9: [26208..28799]:  1175989496..1175992087  2 (3878136..3880727)  2592
> 0
>   10: [28800..31199]:  1175998552..1176000951  2 (3887192..3889591)  2400
> 0
>   11: [31200..33895]:  1176008336..1176011031  2 (3896976..3899671)  2696
> 0
>   12: [33896..36591]:  1176031696..1176034391  2 (3920336..3923031)  2696
> 0
>   13: [36592..39191]:  1176037440..1176040039  2 (3926080..3928679)  2600
> 0
>   14: [39192..41839]:  1176072008..1176074655  2 (3960648..3963295)  2648
> 0
>   15: [41840..44423]:  1176097752..1176100335  2 (3986392..3988975)  2584
> 0
>   16: [44424..46879]:  1176132144..1176134599  2 (4020784..4023239)  2456
> 0
>
>
>
>
>
> ср, 19 июн. 2019 г. в 10:56, Todd Lipcon :
>
>>
>>
>> On Wed, Jun 19, 2019 at 12:49 AM Pavel Martynov 
>> wrote:
>>
>>> Hi Todd, thanks for the answer!
>>>
>>> > Any chance you've done something like copy the files away and back
>>> that might cause them to lose their sparseness?
>>>
>>> No, I don't think so. Recently we experienced some problems with
>>> stability with Kudu, and ran rebalance a couple of times, if this related.
>>> But we never used fs commands like cp/mv against Kudu dirs.
>>>
>>> I ran du on all-WALs dir:
>>> # du -sh /mnt/data01/kudu-tserver-wal/
>>> 12G /mnt/data01/kudu-tserver-wal/
>>>
>>> # du -sh --apparent-size /mnt/data01/kudu-tserver-wal/
>>> 25G /mnt/data01/kudu-tserver-wal/
>>>
>>> And on WAL with a many indexes:
>>> # du -sh --apparent-size
>>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
>>> 306M
>>>  /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
>>>
>>> # du -sh
>>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
>>> 296M
>>>  /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
>>>
>>>
>>> > Also, any chance you're using XFS here?
>>>
>>> Yes, exactly XFS. We use CentOS 7.6.
>>>
>>> What is interesting, there are no many holes in index files in
>>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f (WAL dir
>>> that I mention before). Only single hole in single index file (of 13 files):
>>> # xfs_bmap -v index.00120
>>>
>>
>> Try adding the '-p' flag here? That should show preallocated extents.
>> Would be interesting to run it on some index file which is larger than 1MB,
>> for example.
>>
>>
>>> index.00120:
>>>  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSET  

Re: Need information about internals of InList predicate

2019-06-26 Thread Todd Lipcon
Hi Sergey,

The optimization you're looking for is essentially to realize that IN-list
on a primary key prefix can be converted as follows:

scan(PK in (1,2,3)) ->
scan(PK = 1 OR PK = 2 OR PK = 3) ->
scan(PK = 1) union all scan(pk = 2) union all scan(PK = 3)

Currently, the tserver doesn't support conversion of a single user-facing
scan into multiple internal scan ranges in the general case. Doing so would
require a bit of surgery on the tablet server to understand the concept
that a scan has a set of disjoint PK ranges rather than a single range
associated. I filed a JIRA to support this here:
https://issues.apache.org/jira/browse/KUDU-2875

That said, there's a separate optimization which is simpler to implement,
which is to notice within a given DiskRowSet (small chunk of rows) that
only a single value in the IN-list can be present. In that case the IN-list
can convert, locally, to an equality predicate which may be satisfied by a
range scan or skipped entirely. I added this note to
https://issues.apache.org/jira/browse/KUDU-1644

Thanks
Todd

On Tue, Jun 25, 2019 at 9:24 PM Sergey Olontsev  wrote:

> Does anyone could help to find more about how InList predicates work?
>
> I have a bunch of values (sometimes just a couple, or a few, but
> potentially it could be tens or hundreds), and I have a table with primary
> key on column for the searching values with hash partitioning.
>
> And I've notices, that several separate searches by primary key with
> Comparison predicate usually work faster that one with InList predicate.
> I'm looking and Scanners information on gui and see, that by using
> Comparison predicate my app is reading only 1 block and it takes
> miliseconds, but with InList predicate it reads ~1.6 blocks several times
> (scanning with a batch of 1 million rows) and each scanner takes about
> 1-1.5 seconds to complete.
>
> So, really need more information about how exactly InList predicates are
> implemented and behave. Anyone could provide any links? Unfortunately, I
> was unable find any information, a few JIRA tasks only, but that didn't
> helped.
>
> https://issues.apache.org/jira/browse/KUDU-2853
> https://issues.apache.org/jira/browse/KUDU-1644
>
> Best regards, Sergey.
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: WAL size estimation

2019-06-19 Thread Todd Lipcon
On Wed, Jun 19, 2019 at 12:49 AM Pavel Martynov  wrote:

> Hi Todd, thanks for the answer!
>
> > Any chance you've done something like copy the files away and back that
> might cause them to lose their sparseness?
>
> No, I don't think so. Recently we experienced some problems with stability
> with Kudu, and ran rebalance a couple of times, if this related. But we
> never used fs commands like cp/mv against Kudu dirs.
>
> I ran du on all-WALs dir:
> # du -sh /mnt/data01/kudu-tserver-wal/
> 12G /mnt/data01/kudu-tserver-wal/
>
> # du -sh --apparent-size /mnt/data01/kudu-tserver-wal/
> 25G /mnt/data01/kudu-tserver-wal/
>
> And on WAL with a many indexes:
> # du -sh --apparent-size
> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
> 306M/mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
>
> # du -sh /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
> 296M/mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f
>
>
> > Also, any chance you're using XFS here?
>
> Yes, exactly XFS. We use CentOS 7.6.
>
> What is interesting, there are no many holes in index files in
> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f (WAL dir
> that I mention before). Only single hole in single index file (of 13 files):
> # xfs_bmap -v index.00120
>

Try adding the '-p' flag here? That should show preallocated extents. Would
be interesting to run it on some index file which is larger than 1MB, for
example.


> index.00120:
>  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSET  TOTAL
>0: [0..4231]:   1176541248..1176545479  2 (4429888..4434119)  4232
>1: [4232..9815]:1176546592..1176552175  2 (4435232..4440815)  5584
>2: [9816..11583]:   1176552832..1176554599  2 (4441472..4443239)  1768
>3: [11584..13319]:  1176558672..1176560407  2 (4447312..4449047)  1736
>4: [13320..15239]:  1176565336..1176567255  2 (4453976..4455895)  1920
>5: [15240..17183]:  1176570776..1176572719  2 (4459416..4461359)  1944
>6: [17184..18999]:  1176575856..1176577671  2 (4464496..4466311)  1816
>7: [19000..20927]:  1176593552..1176595479  2 (4482192..4484119)  1928
>8: [20928..22703]:  1176599128..1176600903  2 (4487768..4489543)  1776
>9: [22704..24575]:  1176602704..1176604575  2 (4491344..4493215)  1872
>   10: [24576..26495]:  1176611936..1176613855  2 (4500576..4502495)  1920
>   11: [26496..26655]:  1176615040..1176615199  2 (4503680..4503839)   160
>   12: [26656..46879]:  hole 20224
>
> But in some other WAL I see like this:
> # xfs_bmap -v
> /mnt/data01/kudu-tserver-wal/wals/508ecdfa8904bdb97a02078a91822af/index.0
>
> /mnt/data01/kudu-tserver-wal/wals/508ecdfa89054bdb97a02078a91822af/index.0:
>  EXT: FILE-OFFSET  BLOCK-RANGEAG AG-OFFSETTOTAL
>0: [0..7]:  1758753776..1758753783  3 (586736..586743) 8
>1: [8..46879]:  hole   46872
>
> Looks like there actually used only 8 blocks and all other blocks are the
> hole.
>
>
> So looks like I can use formulas with confidence.
> Normal case: 8 MB/segment * 80 max segments * 2000 tablets = 1,280,000 MB
> = ~1.3 TB (+ some minor index overhead)
> Worse case: 8 MB/segment * 1 segment * 2000 tablets = 1,280,000 MB = ~16
> GB (+ some minor index overhead)
>
> Right?
>
>
> ср, 19 июн. 2019 г. в 09:35, Todd Lipcon :
>
>> Hi Pavel,
>>
>> That's not quite expected. For example, on one of our test clusters here,
>> we have about 65GB of WALs and about 1GB of index files. If I recall
>> correctly, the index files store 8 bytes per WAL entry, so typically a
>> couple orders of magnitude smaller than the WALs themselves.
>>
>> One thing is that the index files are sparse. Any chance you've done
>> something like copy the files away and back that might cause them to lose
>> their sparseness? If I use du --apparent-size on mine, it's total of about
>> 180GB vs the 1GB of actual size.
>>
>> Also, any chance you're using XFS here? XFS sometimes likes to
>> preallocate large amounts of data into files while they're open, and only
>> frees it up if disk space is contended. I think you can use 'xfs_bmap' on
>> an index file to see the allocation status, which might be interesting.
>>
>> -Todd
>>
>> On Tue, Jun 18, 2019 at 11:12 PM Pavel Martynov 
>> wrote:
>>
>>> Hi guys!
>>>
>>> We want to buy SSDs for TServers WALs for our cluster. I'm working on
>>> capacity estimation for this SSDs using "Getting Started with Kudu" book,
&

Re: Kudu CLI tool JSON format

2019-06-11 Thread Todd Lipcon
I guess the issue is that we use rapidjson's 'String' support to write out
C++ strings, which are binary data, not valid UTF8. That's somewhat
incorrect of us, and we should be base64-encoding such binary data.

Fixing this is a little bit incompatible, but for something like partition
keys I think we probably should do it anyway and release note it,
considering partition keys are quite likely to be invalid UTF8.

-Todd

On Tue, Jun 11, 2019 at 6:08 AM Pavel Martynov  wrote:

> Hi, guys!
>
> We trying to use an output of "kudu cluster ksck master -ksck_format
> json_compact" for integration with our monitoring system and hit a little
> strange. Some part of output can't be read as UTF-8 with Python 3:
> $ kudu cluster ksck master -ksck_format json_compact > kudu.json
> $ python
> with open(' kudu.json', mode='rb') as file:
>   bs = file.read()
>   bs.decode('utf-8')
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position
> 705196: invalid start byte
>
> There how SublimeText shows this block of text:
> https://yadi.sk/i/4zpWKZ37iP8OEA
> As you can see kudu tool encodes zeros as \u, but don't encode some
> other non-text bytes.
>
> What do you think about it?
>
> --
> with best regards, Pavel Martynov
>


-- 
Todd Lipcon
Software Engineer, Cloudera


[ANNOUNCE] Welcoming Yingchun Lai as a Kudu committer and PMC member

2019-06-05 Thread Todd Lipcon
Hi Kudu community,

I'm happy to announce that the Kudu PMC has voted to add Yingchun Lai as a
new committer and PMC member.

Yingchun has been contributing to Kudu for the last 6-7 months and
contributed a number of bug fixes, improvements, and features, including:
- new CLI tools (eg 'kudu table scan', 'kudu table copy')
- fixes for compilation warnings, code cleanup, and usability improvements
on the web UI
- support for prioritization of tables for maintenance manager tasks
- CLI support for config files to make it easier to connect to multi-master
clusters

Yingchun has also been contributing by helping new users on Slack, and
helps operate 6 production clusters at Xiaomi, one of our larger
installations in China.

Please join me in congratulating Yingchun!

-Todd


Re: problems with impala+kudu

2019-05-17 Thread Todd Lipcon
This sounds like an error coming from the Impala layer -- Kudu doesn't use
Thrift for communication internally. At first glance it looks like one of
your Impala daemons may have crashed while trying to execute the query, so
I'd look around for impalad logs indicating that. If you need further
assistance I think the impala user mailing list may be able to help more.

Thanks
Todd

On Fri, May 17, 2019 at 8:51 AM 林锦明  wrote:

> Dear friends,
>
>   When I try to insert data into kudu table by impala sql, here comes the
> exception: TSocket read 0 bytes (code THRIFTTRANSPORT):
> TTransportException('TSocket read 0 bytes',).
> Could you pls tell me how to deal with this problem? By the way, the kudu
> is installed by rpm, the relatived url:
> https://github.com/MartinWeindel/kudu-rpm.
>
>
> Best wishes.
> yours truly,
> Jack Lin
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: "broadcast" tablet replication for kudu?

2019-04-24 Thread Todd Lipcon
Hey Boris,

Sorry to say that the situation is still the same.

-Todd

On Wed, Apr 24, 2019 at 9:02 AM Boris Tyukin  wrote:

> sorry to revive the old thread but curious if there is a better solution 1
> year after...We have a few small tables (under 300k rows) which are
> practically used with every single query and to make things worse joined
> more than once in the same query.
>
> Is there a way to replicate this table on every node to improve
> performance and avoid broadcasting this table every time?
>
> On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon  wrote:
>
>>
>>
>> On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin  wrote:
>>
>>> Hi Todd,
>>>
>>> Are you saying that your earlier comment below is not longer valid with
>>> Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can
>>> benefit from this?
>>>
>>
>> No, the earlier comment is still valid. Just saying that in some cases
>> exchange can be faster in the new Impala version.
>>
>>
>>> "
>>> *It's worth noting that, even if your table is replicated, Impala's
>>> planner is unaware of this fact and it will give the same plan regardless.
>>> That is to say, rather than every node scanning its local copy, instead a
>>> single node will perform the whole scan (assuming it's a small table) and
>>> broadcast it from there within the scope of a single query. So, I don't
>>> think you'll see any performance improvements on Impala queries by
>>> attempting something like an extremely high replication count.*
>>>
>>> *I could see bumping the replication count to 5 for these tables since
>>> the extra storage cost is low and it will ensure higher availability of the
>>> important central tables, but I'd be surprised if there is any measurable
>>> perf impact.*
>>> "
>>>
>>> On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon  wrote:
>>>
>>>> Are you on the latest release of Impala? It switched from using Thrift
>>>> for RPC to a new implementation (actually borrowed from kudu) which might
>>>> help broadcast performance a bit.
>>>>
>>>> Todd
>>>>
>>>> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin 
>>>> wrote:
>>>>
>>>>> sorry to revive the old thread but I am curious if there is a good way
>>>>> to speed up requests to frequently used tables in Kudu.
>>>>>
>>>>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin 
>>>>> wrote:
>>>>>
>>>>>> bummer..After reading your guys conversation, I wish there was an
>>>>>> easier way...we will have the same issue as we have a few dozens of 
>>>>>> tables
>>>>>> which are used very frequently in joins and I was hoping there was an 
>>>>>> easy
>>>>>> way to replicate them on most of the nodes to avoid broadcasts every time
>>>>>>
>>>>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <
>>>>>> cresn...@mediamath.com> wrote:
>>>>>>
>>>>>>> The table in our case is 12x hashed and ranged by month, so the
>>>>>>> broadcasts were often to all (12) nodes.
>>>>>>>
>>>>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal 
>>>>>>> wrote:
>>>>>>> Sorry I left that out Cliff, FWIW it does seem to have been
>>>>>>> broadcast..
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Not sure though how a shuffle would be much different from a
>>>>>>> broadcast if entire table is 1 file/block in 1 node.
>>>>>>>
>>>>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> From the screenshot it does not look like there was a broadcast of
>>>>>>>> the dimension table(s), so it could be the case here that the multiple
>>>>>>>> smaller sends helps. Our dim tables are generally in the single-digit
>>>>>>>> millions and Impala chooses to broadcast them. Since the fact result
>>>>>>>> cardinality is always much smaller, we've found that forcing a 
>>>>>>>> [shuffle]
>>>>>>>> dimension join is actually faster since it only sends dims once rather 
>>>>>>>> th

Re: trying to install kudu from source

2018-12-10 Thread Todd Lipcon
make/kuduClientConfig.cmake
> -- Munging kudu client targets in
> /usr/share/kuduClient/cmake/kuduClientTargets-release.cmake
> -- Munging kudu client targets in
> /usr/share/kuduClient/cmake/kuduClientTargets.cmake
> .
>
>
> this is my install/docker script so far:
> ROM debian:latest
> # Install dependencies
> RUN apt-get update && export DEBIAN_FRONTEND=noninteractive && \
> apt-get -y install apt-utils \
> aptitude \
> autoconf \
> automake \
> curl \
> dstat \
> emacs24-nox \
> flex \
> g++ \
> gcc \
> gdb \
> git \
> krb5-admin-server \
> krb5-kdc \
> krb5-user \
> libkrb5-dev \
> libsasl2-dev \
> libsasl2-modules \
> libsasl2-modules-gssapi-mit \
> libssl-dev \
> libtool \
> lsb-release \
> make \
> ntp \
> net-tools \
> openjdk-8-jdk \
> openssl \
> patch \
> python-dev \
> python-pip \
> python3-dev \
> python3 \
> python3-pip \
> pkg-config \
> python \
> rsync \
> unzip \
> vim-common \
> wget
>
> #Install Kudu
> #RUN git clone https://github.com/apache/kudu  \
> user@kudu.apache.orgWORKDIR /
> RUN wget
> https://www-us.apache.org/dist/kudu/1.8.0/apache-kudu-1.8.0.tar.gz
> RUN mkdir -p /kudu && tar -xzf apache-kudu-1.8.0.tar.gz  -C /kudu
> --strip-components=1
> RUN ls /
>
> RUN cd /kudu \
> && thirdparty/build-if-necessary.sh
> RUN cd /kudu &&  mkdir -p build/release \
> && cd /kudu/build/release \
> && ../../thirdparty/installed/common/bin/cmake -DCMAKE_BUILD_TYPE=release
> -DCMAKE_INSTALL_PREFIX:PATH=/usr ../.. \
> && make -j4
>
> RUN cd /kudu/build/release \
> && make install
>
>
>
>
>
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: strange behavior of getPendingErrors

2018-11-17 Thread Todd Lipcon
Hey Alexey,

I think your explanation makes sense from an implementation perspective.
But, I think we should treat this behavior as a bug. From the user
perspective, such an error is a per-row data issue and should only affect
the row with the problem, not some arbitrary subset of rows in the batch
which happened to share a partition.

Does anyone disagree?

Todd

On Fri, Nov 16, 2018, 9:28 PM Alexey Serbin  Hi Boris,
>
> Kudu clients (both Java and C++ ones) send write operations to
> corresponding tablet servers in batches when using the AUTO_FLUSH_BACKGROUND
> and MANUAL_FLUSH modes.  When a tablet server receives a Write RPC
> (WriteRequestPB is the corresponding type of the parameter), it decodes the
> operations from the batch:
> https://github.com/apache/kudu/blob/master/src/kudu/tablet/local_tablet_writer.h#L97
>
> While decoding operations from a batch, various constraints are being
> checked.  One of those is checking for nulls in non-nullable columns.  If
> there is a row in the batch that violates the non-nullable constraint, the
> whole batch is rejected.
>
> That's exactly what happened in your example: a batch to one tablet
> consisted of 3 rows one of which had a row with violation of the
> non-nullable constraint for the dt_tm column, so the whole batch of 3
> operations was rejected.  You can play with different partition schemes:
> e.g., in case of 10 hashed partitions it might happen that only 2
> operations would be rejected, in case of 30 partitions -- just the single
> key==2 row could be rejected.
>
> BTW, that might also happen if using the MANUAL_FLUSH mode.  However, with
> the AUTO_FLUSH_SYNC mode, the client sends operations in batches of size 1.
>
>
> Kind regards,
>
> Alexey
>
> On Fri, Nov 16, 2018 at 7:24 PM Boris Tyukin 
> wrote:
>
>> Hi Todd,
>>
>> We are on Kudu 1.5 still and I used Kudu client 1.7
>>
>> Thanks,
>> Boris
>>
>> On Fri, Nov 16, 2018, 17:07 Todd Lipcon >
>>> Hi Boris,
>>>
>>> This is interesting. Just so we're looking at the same code, what
>>> version of the kudu-client dependency have you specified, and what version
>>> of the server?
>>>
>>> -Todd
>>>
>>> On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin 
>>> wrote:
>>>
>>>> Hey guys,
>>>>
>>>> I am playing with Kudu Java client (wow it is fast), using mostly code
>>>> from Kudu Java example.
>>>>
>>>> While learning about exceptions during rows inserts, I stumbled upon
>>>> something I could not explain.
>>>>
>>>> If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND
>>>> mode) and I make one row to be "bad" intentionally (one column cannot be
>>>> NULL), I actually get 3 rows that cannot be inserted into Kudu, not 1 as I
>>>> was expected.
>>>>
>>>> But if I do session.flush() after every single insert, I get only one
>>>> error row (but this ruins the purpose of AUTO_FLUSH_BACKGROUND mode).
>>>>
>>>> Any ideas one? We cannot afford losing data and need to track all rows
>>>> which cannot be inserted.
>>>>
>>>> AUTO_FLUSH mode works much better and I do not have an issue like
>>>> above, but then it is way slower than AUTO_FLUSH_BACKGROUND.
>>>>
>>>> My code is below. It is in Groovy, but I think you will get an idea :)
>>>> https://gist.github.com/boristyukin/8703d2c6ec55d6787843aa133920bf01
>>>>
>>>> Here is output from my test code that hopefully illustrates my
>>>> confusion - out of 10 rows inserted, 9 should be good and 1 bad, but it
>>>> turns out Kudu flagged 3 as bad:
>>>>
>>>> Created table kudu_groovy_example
>>>> Inserting 10 rows in AUTO_FLUSH_BACKGROUND flush mode ...
>>>> (int32 key=1, string value="value 1", unixtime_micros
>>>> dt_tm=2018-11-16T20:57:03.469000Z)
>>>> (int32 key=2, string value=NULL)  BAD ROW
>>>> (int32 key=3, string value="value 3", unixtime_micros
>>>> dt_tm=2018-11-16T20:57:03.595000Z)
>>>> (int32 key=4, string value=NULL, unixtime_micros
>>>> dt_tm=2018-11-16T20:57:03.596000Z)
>>>> (int32 key=5, string value="value 5", unixtime_micros
>>>> dt_tm=2018-11-16T20:57:03.597000Z)
>>>> (int32 key=6, string value=NULL, unixtime_micros
>>>> dt_tm=2018-11-16T20:57:03.597000Z)
>>>> (int32 key=7, string value="value 7", unixtime_micros
>>>> dt_tm=2018-11-1

Re: strange behavior of getPendingErrors

2018-11-16 Thread Todd Lipcon
Hi Boris,

This is interesting. Just so we're looking at the same code, what version
of the kudu-client dependency have you specified, and what version of the
server?

-Todd

On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin  wrote:

> Hey guys,
>
> I am playing with Kudu Java client (wow it is fast), using mostly code
> from Kudu Java example.
>
> While learning about exceptions during rows inserts, I stumbled upon
> something I could not explain.
>
> If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND
> mode) and I make one row to be "bad" intentionally (one column cannot be
> NULL), I actually get 3 rows that cannot be inserted into Kudu, not 1 as I
> was expected.
>
> But if I do session.flush() after every single insert, I get only one
> error row (but this ruins the purpose of AUTO_FLUSH_BACKGROUND mode).
>
> Any ideas one? We cannot afford losing data and need to track all rows
> which cannot be inserted.
>
> AUTO_FLUSH mode works much better and I do not have an issue like above,
> but then it is way slower than AUTO_FLUSH_BACKGROUND.
>
> My code is below. It is in Groovy, but I think you will get an idea :)
> https://gist.github.com/boristyukin/8703d2c6ec55d6787843aa133920bf01
>
> Here is output from my test code that hopefully illustrates my confusion -
> out of 10 rows inserted, 9 should be good and 1 bad, but it turns out Kudu
> flagged 3 as bad:
>
> Created table kudu_groovy_example
> Inserting 10 rows in AUTO_FLUSH_BACKGROUND flush mode ...
> (int32 key=1, string value="value 1", unixtime_micros
> dt_tm=2018-11-16T20:57:03.469000Z)
> (int32 key=2, string value=NULL)  BAD ROW
> (int32 key=3, string value="value 3", unixtime_micros
> dt_tm=2018-11-16T20:57:03.595000Z)
> (int32 key=4, string value=NULL, unixtime_micros
> dt_tm=2018-11-16T20:57:03.596000Z)
> (int32 key=5, string value="value 5", unixtime_micros
> dt_tm=2018-11-16T20:57:03.597000Z)
> (int32 key=6, string value=NULL, unixtime_micros
> dt_tm=2018-11-16T20:57:03.597000Z)
> (int32 key=7, string value="value 7", unixtime_micros
> dt_tm=2018-11-16T20:57:03.598000Z)
> (int32 key=8, string value=NULL, unixtime_micros
> dt_tm=2018-11-16T20:57:03.602000Z)
> (int32 key=9, string value="value 9", unixtime_micros
> dt_tm=2018-11-16T20:57:03.603000Z)
> (int32 key=10, string value=NULL, unixtime_micros
> dt_tm=2018-11-16T20:57:03.603000Z)
> 3 errors inserting rows - why 3 only 1 expected to be bad...
> there were errors inserting rows to Kudu
> the first few errors follow:
> ??? key 1 and 6 supposed to be fine!
> Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null,
> status=Invalid argument: No value provided for required column:
> dt_tm[unixtime_micros NOT NULL] (error 0)
> Row error for primary key=[-128, 0, 0, 2], tablet=null, server=null,
> status=Invalid argument: No value provided for required column:
> dt_tm[unixtime_micros NOT NULL] (error 0)
> Row error for primary key=[-128, 0, 0, 6], tablet=null, server=null,
> status=Invalid argument: No value provided for required column:
> dt_tm[unixtime_micros NOT NULL] (error 0)
> Rows counted in 485 ms
> Table has 7 rows - ??? supposed to be 9!
> INT32 key=4, STRING value=NULL, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.596000Z
> INT32 key=8, STRING value=NULL, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.602000Z
> INT32 key=9, STRING value=value 9, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.603000Z
> INT32 key=3, STRING value=value 3, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.595000Z
> INT32 key=10, STRING value=NULL, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.603000Z
> INT32 key=5, STRING value=value 5, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.597000Z
> INT32 key=7, STRING value=value 7, UNIXTIME_MICROS
> dt_tm=2018-11-16T20:57:03.598000Z
>
>
>
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: cannot import kudu.client

2018-08-31 Thread Todd Lipcon
Do you happen to have a directory called 'kudu' in your working directory?
Sometimes python gets confused and imports something you didn't expect. The
output of 'kudu.__file__' might give you a clue.

-Todd

On Fri, Aug 31, 2018 at 3:27 PM, veto  wrote:

> i installed and compiled successfully kudo on jessie, stretch and used
> dockers on centos and ubutu.
>
> on all i installed python2.7 and pip in kudu-pyton==1.7.1 and 1.2.0
> successfully.
>
> i could successfully import kudo but it fails to import kudo.client
>
> here is the log:
>
>
> (env) root@boot2docker:~/kudu# python
> Python 2.7.12 (default, Dec  4 2017, 14:50:18)
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import kudu
> >>> import kudu.client
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named client
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Dictionary encoding

2018-08-06 Thread Todd Lipcon
Hi Saeid,

It's not based on the number of distinct values, but rather on the combined
size of the values. I believe the default is 256kb, so assuming your
strings are pretty short, a few thousand are likely to be able to be
dict-encoded. Note that dictionaries are calculated per-rowset (small chunk
of data) so even if your overall cardinality is much larger, if you have
some spatial locality such that rows with nearby primary keys have fewer
distinct values, then you're likely to get benefit here.

-Todd

On Sat, Aug 4, 2018 at 8:10 AM, Saeid Sattari 
wrote:

> Hi Kudu community,
>
> Does any body know what is the maximum distinct values of a String column
> that Kudu considers in order to set its encoding to Dictionary? Many thanks
> :)
>
> br,
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Todd Lipcon
On Thu, Aug 2, 2018 at 4:54 PM, Quanlong Huang 
wrote:

> Thank Adar and Todd! We'd like to contribute when we could.
>
> Are there any concerns if we share the machines with HDFS DataNodes and
> Yarn NodeManagers? The network bandwidth is 10Gbps. I think it's ok if they
> don't share the same disks, e.g. 4 disks for kudu and the other 11 disks
> for DataNode and NodeManager, and leave enough CPU & mem for kudu. Is that
> right?
>

That should be fine. Typically we actualyl recommend sharing all the disks
for all of the services. There is a trade-off between static partitioning
(exclusive access to a smaller number of disks) vs dynamic sharing
(potential contention but more available resources). Unless your workload
is very latency sensitive I usually think it's better to have the bigger
pool of resources available even if it needs to share with other systems.

One recommendation, though is to consider using a dedicated disk for the
Kudu WAL and metadata, which can help performance, since the WAL can be
sensitive to other heavy workloads monopolizing bandwidth on the same
spindle.

-Todd

>
> At 2018-08-03 02:26:37, "Todd Lipcon"  wrote:
>
> +1 to what Adar said.
>
> One tension we have currently for scaling is that we don't want to scale
> individual tablets too large, because of problems like the superblock that
> Adar mentioned. However, the solution of just having more tablets is also
> not a great one, since many of our startup time problems are primarily
> affected by the number of tablets more than their size (see KUDU-38 as the
> prime, ancient, example). Additionally, having lots of tablets increases
> raft heartbeat traffic and may need to dial back those heartbeat intervals
> to keep things stable.
>
> All of these things can be addressed in time and with some work. If you
> are interested in working on these areas to improve density that would be a
> great contribution.
>
> -Todd
>
>
>
> On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo 
> wrote:
>
>> The 8TB limit isn't a hard one, it's just a reflection of the scale
>> that Kudu developers commonly test. Beyond 8TB we can't vouch for
>> Kudu's stability and performance. For example, we know that as the
>> amount of on-disk data grows, node restart times get longer and longer
>> (see KUDU-2014 for some ideas on how to improve that). Furthermore, as
>> tablets accrue more data blocks, their superblocks become larger,
>> raising the minimum amount of I/O for any operation that rewrites a
>> superblock (such as a flush or compaction). Lastly, the tablet copy
>> protocol used in rereplication tries to copy the entire superblock in
>> one RPC message; if the superblock is too large, it'll run up against
>> the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc).
>>
>> These examples are just off the top of my head; there may be others
>> lurking. So this goes back to what I led with: beyond the recommended
>> limit we aren't quite sure how Kudu's performance and stability are
>> affected.
>>
>> All that said, you're welcome to try it out and report back with your
>> findings.
>>
>>
>> On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang 
>> wrote:
>> >
>> > Hi all,
>> >
>> > In the document of "Known Issues and Limitations", it's recommended
>> that "maximum amount of stored data, post-replication and post-compression,
>> per tablet server is 8TB". How is the 8TB calculated?
>> >
>> > We have some machines each with 15 * 4TB spinning disk drives and 256GB
>> RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is
>> recommended to leave for other systems? We prefer to make the machine
>> dedicated to Kudu. Can tablet server leverage the whole space efficiently?
>> >
>> > Thanks,
>> > Quanlong
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Recommended maximum amount of stored data per tablet server

2018-08-02 Thread Todd Lipcon
+1 to what Adar said.

One tension we have currently for scaling is that we don't want to scale
individual tablets too large, because of problems like the superblock that
Adar mentioned. However, the solution of just having more tablets is also
not a great one, since many of our startup time problems are primarily
affected by the number of tablets more than their size (see KUDU-38 as the
prime, ancient, example). Additionally, having lots of tablets increases
raft heartbeat traffic and may need to dial back those heartbeat intervals
to keep things stable.

All of these things can be addressed in time and with some work. If you are
interested in working on these areas to improve density that would be a
great contribution.

-Todd



On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo 
wrote:

> The 8TB limit isn't a hard one, it's just a reflection of the scale
> that Kudu developers commonly test. Beyond 8TB we can't vouch for
> Kudu's stability and performance. For example, we know that as the
> amount of on-disk data grows, node restart times get longer and longer
> (see KUDU-2014 for some ideas on how to improve that). Furthermore, as
> tablets accrue more data blocks, their superblocks become larger,
> raising the minimum amount of I/O for any operation that rewrites a
> superblock (such as a flush or compaction). Lastly, the tablet copy
> protocol used in rereplication tries to copy the entire superblock in
> one RPC message; if the superblock is too large, it'll run up against
> the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc).
>
> These examples are just off the top of my head; there may be others
> lurking. So this goes back to what I led with: beyond the recommended
> limit we aren't quite sure how Kudu's performance and stability are
> affected.
>
> All that said, you're welcome to try it out and report back with your
> findings.
>
>
> On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang 
> wrote:
> >
> > Hi all,
> >
> > In the document of "Known Issues and Limitations", it's recommended that
> "maximum amount of stored data, post-replication and post-compression, per
> tablet server is 8TB". How is the 8TB calculated?
> >
> > We have some machines each with 15 * 4TB spinning disk drives and 256GB
> RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is
> recommended to leave for other systems? We prefer to make the machine
> dedicated to Kudu. Can tablet server leverage the whole space efficiently?
> >
> > Thanks,
> > Quanlong
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb

2018-08-01 Thread Todd Lipcon
On Wed, Aug 1, 2018 at 4:52 PM, Quanlong Huang 
wrote:

> In my experience, when I found the performance is below my expectation,
> I'd like to tune flags listed in https://kudu.apache.org/
> docs/configuration_reference.html , which needs a clear understanding of
> kudu internals. Maybe we can add the link there?
>
>
Any particular flags that you found you had to tune? I almost never advise
tuning anything other than the number of maintenance threads. If you have
some good guidance on how tuning those flags can improve performance, maybe
we can consider changing the defaults or giving some more prescriptive
advice?

I'm a little nervous that saying "here are all the internals, and here are
100 config flags to study" will scare users more than help them :)

-Todd


>
> At 2018-08-02 01:06:40,"Todd Lipcon"  wrote:
>
> On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang 
> wrote:
>
>> Hi Todd and William,
>>
>> I'm really appreciated for your help and sorry for my late reply. I was
>> going to reply with some follow-up questions but was assigned to focus some
>> other works... Now I'm back to this work.
>>
>> The design docs are really helpful. Now I understand the flush and
>> compaction. I think we can add a link to these design docs in the kudu
>> documentation page, so users who want to dig deeper can know more about
>> kudu internal.
>>
>
> Personally, since starting the project, I have had the philosophy that the
> user-facing documentation should remain simple and not discuss internals
> too much. I found in some other open source projects that there isn't a
> clear difference between user documentation and developer documentation,
> and users can easily get confused by all of the internal details. Or, users
> may start to believe that Kudu is very complex and they need to understand
> knapsack problem approximation algorithms in order to operate it. So,
> normally we try to avoid exposing too much of the details.
>
> That said, I think it is a good idea to add a small note in the
> documentation somewhere that links to the design docs, maybe with some
> sentence explaining that understanding internals is not necessary to
> operate Kudu, but that expert users may find the internal design useful as
> a reference? I would be curious to hear what other users think about how
> best to make this trade-off.
>
> -Todd
>
>
>> At 2018-06-15 23:41:17, "Todd Lipcon"  wrote:
>>
>> Also, keep in mind that when the MRS flushes, it flushes into a bunch of
>> separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
>> default). This is set by --budgeted_compaction_target_rowset_size
>>
>> However, increasing this size isn't likely to decrease the number of
>> compactions, because each of these 32MB rowsets is non-overlapping. In
>> other words, if your MRS contains rows A-Z, the output RowSets will include
>> [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will
>> never need to be compacted with each other. The net result, here, is that
>> compaction becomes more fine-grained and only needs to operate on
>> sub-ranges of the tablet where there is a lot of overlap.
>>
>> You can read more about this in docs/design-docs/compaction-policy.md,
>> in particular the section "Limiting RowSet Sizes"
>>
>> Hope that helps
>> -Todd
>>
>> On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley 
>> wrote:
>>
>>> The op seen in the logs is a rowset compaction, which takes existing
>>> diskrowsets and rewrites them. It's not a flush, which writes data in
>>> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
>>> compaction is done to reduce the amount of overlap of rowsets in primary
>>> key space, i.e. reduce the number of rowsets that might need to be checked
>>> to enforce the primary key constraint or find a row. Having lots of rowset
>>> compaction indicates that rows are being written in a somewhat random order
>>> w.r.t the primary key order. Kudu will perform much better as writes scale
>>> when rows are inserted roughly in increasing order per tablet.
>>>
>>> Also, because you are using the log block manager (the default and only
>>> one suitable for production deployments), there isn't a 1-1 relationship
>>> between cfiles or diskrowsets and files on the filesystem. Many cfiles and
>>> diskrowsets will be put together in a container file.
>>>
>>> Config parameters that might be relevant here:
>>> --maintenance_manager_num_threads
>>> --fs_data_dirs (how many)
>&

Re: Re: Why RowSet size is much smaller than flush_threshold_mb

2018-08-01 Thread Todd Lipcon
On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang 
wrote:

> Hi Todd and William,
>
> I'm really appreciated for your help and sorry for my late reply. I was
> going to reply with some follow-up questions but was assigned to focus some
> other works... Now I'm back to this work.
>
> The design docs are really helpful. Now I understand the flush and
> compaction. I think we can add a link to these design docs in the kudu
> documentation page, so users who want to dig deeper can know more about
> kudu internal.
>

Personally, since starting the project, I have had the philosophy that the
user-facing documentation should remain simple and not discuss internals
too much. I found in some other open source projects that there isn't a
clear difference between user documentation and developer documentation,
and users can easily get confused by all of the internal details. Or, users
may start to believe that Kudu is very complex and they need to understand
knapsack problem approximation algorithms in order to operate it. So,
normally we try to avoid exposing too much of the details.

That said, I think it is a good idea to add a small note in the
documentation somewhere that links to the design docs, maybe with some
sentence explaining that understanding internals is not necessary to
operate Kudu, but that expert users may find the internal design useful as
a reference? I would be curious to hear what other users think about how
best to make this trade-off.

-Todd


> At 2018-06-15 23:41:17, "Todd Lipcon"  wrote:
>
> Also, keep in mind that when the MRS flushes, it flushes into a bunch of
> separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
> default). This is set by --budgeted_compaction_target_rowset_size
>
> However, increasing this size isn't likely to decrease the number of
> compactions, because each of these 32MB rowsets is non-overlapping. In
> other words, if your MRS contains rows A-Z, the output RowSets will include
> [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will
> never need to be compacted with each other. The net result, here, is that
> compaction becomes more fine-grained and only needs to operate on
> sub-ranges of the tablet where there is a lot of overlap.
>
> You can read more about this in docs/design-docs/compaction-policy.md, in
> particular the section "Limiting RowSet Sizes"
>
> Hope that helps
> -Todd
>
> On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley 
> wrote:
>
>> The op seen in the logs is a rowset compaction, which takes existing
>> diskrowsets and rewrites them. It's not a flush, which writes data in
>> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
>> compaction is done to reduce the amount of overlap of rowsets in primary
>> key space, i.e. reduce the number of rowsets that might need to be checked
>> to enforce the primary key constraint or find a row. Having lots of rowset
>> compaction indicates that rows are being written in a somewhat random order
>> w.r.t the primary key order. Kudu will perform much better as writes scale
>> when rows are inserted roughly in increasing order per tablet.
>>
>> Also, because you are using the log block manager (the default and only
>> one suitable for production deployments), there isn't a 1-1 relationship
>> between cfiles or diskrowsets and files on the filesystem. Many cfiles and
>> diskrowsets will be put together in a container file.
>>
>> Config parameters that might be relevant here:
>> --maintenance_manager_num_threads
>> --fs_data_dirs (how many)
>> --fs_wal_dir (is it shared on a device with the data dir?)
>>
>> The metrics from the compact row sets op indicates the time is spent in
>> fdatasync and in reading (likely reading the original rowsets). The overall
>> compaction time is kinda long but not crazy long. What's the performance
>> you are seeing and what is the performance you would like to see?
>>
>> -Will
>>
>> On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
>>> server, I find most of the compactions are compacting small files (~40MB
>>> for each). For example:
>>>
>>> I0615 07:22:42.637351 30614 tablet.cc:1661] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Compaction: stage 1 complete, picked 4 rowsets to compact
>>> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
>>> compact:
>>> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current size
>>> on dis

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Todd Lipcon
On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin  wrote:

> Hi Todd,
>
> Are you saying that your earlier comment below is not longer valid with
> Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can
> benefit from this?
>

No, the earlier comment is still valid. Just saying that in some cases
exchange can be faster in the new Impala version.


> "
> *It's worth noting that, even if your table is replicated, Impala's
> planner is unaware of this fact and it will give the same plan regardless.
> That is to say, rather than every node scanning its local copy, instead a
> single node will perform the whole scan (assuming it's a small table) and
> broadcast it from there within the scope of a single query. So, I don't
> think you'll see any performance improvements on Impala queries by
> attempting something like an extremely high replication count.*
>
> *I could see bumping the replication count to 5 for these tables since the
> extra storage cost is low and it will ensure higher availability of the
> important central tables, but I'd be surprised if there is any measurable
> perf impact.*
> "
>
> On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon  wrote:
>
>> Are you on the latest release of Impala? It switched from using Thrift
>> for RPC to a new implementation (actually borrowed from kudu) which might
>> help broadcast performance a bit.
>>
>> Todd
>>
>> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin  wrote:
>>
>>> sorry to revive the old thread but I am curious if there is a good way
>>> to speed up requests to frequently used tables in Kudu.
>>>
>>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin 
>>> wrote:
>>>
>>>> bummer..After reading your guys conversation, I wish there was an
>>>> easier way...we will have the same issue as we have a few dozens of tables
>>>> which are used very frequently in joins and I was hoping there was an easy
>>>> way to replicate them on most of the nodes to avoid broadcasts every time
>>>>
>>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <
>>>> cresn...@mediamath.com> wrote:
>>>>
>>>>> The table in our case is 12x hashed and ranged by month, so the
>>>>> broadcasts were often to all (12) nodes.
>>>>>
>>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal 
>>>>> wrote:
>>>>> Sorry I left that out Cliff, FWIW it does seem to have been
>>>>> broadcast..
>>>>>
>>>>>
>>>>>
>>>>> Not sure though how a shuffle would be much different from a broadcast
>>>>> if entire table is 1 file/block in 1 node.
>>>>>
>>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick 
>>>>> wrote:
>>>>>
>>>>>> From the screenshot it does not look like there was a broadcast of
>>>>>> the dimension table(s), so it could be the case here that the multiple
>>>>>> smaller sends helps. Our dim tables are generally in the single-digit
>>>>>> millions and Impala chooses to broadcast them. Since the fact result
>>>>>> cardinality is always much smaller, we've found that forcing a [shuffle]
>>>>>> dimension join is actually faster since it only sends dims once rather 
>>>>>> than
>>>>>> all to all nodes. The degenerative performance of broadcast is especially
>>>>>> obvious when the query returns zero results. I don't have much experience
>>>>>> here, but it does seem that Kudu's efficient predicate scans can 
>>>>>> sometimes
>>>>>> "break" Impala's query plan.
>>>>>>
>>>>>> -Cliff
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <
>>>>>> mauri...@impact.com> wrote:
>>>>>>
>>>>>>> @Todd not to belabor the point, but when I suggested breaking up
>>>>>>> small dim tables into multiple parquet files (and in this thread's 
>>>>>>> context
>>>>>>> perhaps partition kudu table, even if small, into multiple tablets), it 
>>>>>>> was
>>>>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>>>>
>>>>>>> For example recently we ran into this slow query where the 14M
>>>>>>> record dimension fit into a single file & block, so it got scanned on a
>>&

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Todd Lipcon
Impala 2.12. The external RPC protocol is still Thrift.

Todd

On Mon, Jul 23, 2018, 7:02 AM Clifford Resnick 
wrote:

> Is this impala 3.0? I’m concerned about breaking changes and our RPC to
> Impala is thrift-based.
>
> From: Todd Lipcon 
> Reply-To: "user@kudu.apache.org" 
> Date: Monday, July 23, 2018 at 9:46 AM
> To: "user@kudu.apache.org" 
> Subject: Re: "broadcast" tablet replication for kudu?
>
> Are you on the latest release of Impala? It switched from using Thrift for
> RPC to a new implementation (actually borrowed from kudu) which might help
> broadcast performance a bit.
>
> Todd
>
> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin  wrote:
>
>> sorry to revive the old thread but I am curious if there is a good way to
>> speed up requests to frequently used tables in Kudu.
>>
>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin 
>> wrote:
>>
>>> bummer..After reading your guys conversation, I wish there was an easier
>>> way...we will have the same issue as we have a few dozens of tables which
>>> are used very frequently in joins and I was hoping there was an easy way to
>>> replicate them on most of the nodes to avoid broadcasts every time
>>>
>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <
>>> cresn...@mediamath.com> wrote:
>>>
>>>> The table in our case is 12x hashed and ranged by month, so the
>>>> broadcasts were often to all (12) nodes.
>>>>
>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal 
>>>> wrote:
>>>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast..
>>>>
>>>>
>>>>
>>>> Not sure though how a shuffle would be much different from a broadcast
>>>> if entire table is 1 file/block in 1 node.
>>>>
>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick 
>>>> wrote:
>>>>
>>>>> From the screenshot it does not look like there was a broadcast of the
>>>>> dimension table(s), so it could be the case here that the multiple smaller
>>>>> sends helps. Our dim tables are generally in the single-digit millions and
>>>>> Impala chooses to broadcast them. Since the fact result cardinality is
>>>>> always much smaller, we've found that forcing a [shuffle] dimension join 
>>>>> is
>>>>> actually faster since it only sends dims once rather than all to all 
>>>>> nodes.
>>>>> The degenerative performance of broadcast is especially obvious when the
>>>>> query returns zero results. I don't have much experience here, but it does
>>>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's
>>>>> query plan.
>>>>>
>>>>> -Cliff
>>>>>
>>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <
>>>>> mauri...@impact.com> wrote:
>>>>>
>>>>>> @Todd not to belabor the point, but when I suggested breaking up
>>>>>> small dim tables into multiple parquet files (and in this thread's 
>>>>>> context
>>>>>> perhaps partition kudu table, even if small, into multiple tablets), it 
>>>>>> was
>>>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>>>
>>>>>> For example recently we ran into this slow query where the 14M record
>>>>>> dimension fit into a single file & block, so it got scanned on a single
>>>>>> node though still pretty quickly (300ms), however it caused the join to
>>>>>> take 25+ seconds and bogged down the entire query.  See highlighted
>>>>>> fragment and its parent.
>>>>>>
>>>>>> So we broke it into several small files the way I described in my
>>>>>> previous post, and now join and query are fast (6s).
>>>>>>
>>>>>> -m
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon 
>>>>>> wrote:
>>>>>>
>>>>>>> I suppose in the case that the dimension table scan makes a
>>>>>>> non-trivial portion of your workload time, then yea, parallelizing the 
>>>>>>> scan
>>>>>>> as you suggest would be beneficial. That said, in typical anal

Re: "broadcast" tablet replication for kudu?

2018-07-23 Thread Todd Lipcon
Are you on the latest release of Impala? It switched from using Thrift for
RPC to a new implementation (actually borrowed from kudu) which might help
broadcast performance a bit.

Todd

On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin  wrote:

> sorry to revive the old thread but I am curious if there is a good way to
> speed up requests to frequently used tables in Kudu.
>
> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin 
> wrote:
>
>> bummer..After reading your guys conversation, I wish there was an easier
>> way...we will have the same issue as we have a few dozens of tables which
>> are used very frequently in joins and I was hoping there was an easy way to
>> replicate them on most of the nodes to avoid broadcasts every time
>>
>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick > > wrote:
>>
>>> The table in our case is 12x hashed and ranged by month, so the
>>> broadcasts were often to all (12) nodes.
>>>
>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal 
>>> wrote:
>>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast..
>>>
>>>
>>>
>>> Not sure though how a shuffle would be much different from a broadcast
>>> if entire table is 1 file/block in 1 node.
>>>
>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick  wrote:
>>>
>>>> From the screenshot it does not look like there was a broadcast of the
>>>> dimension table(s), so it could be the case here that the multiple smaller
>>>> sends helps. Our dim tables are generally in the single-digit millions and
>>>> Impala chooses to broadcast them. Since the fact result cardinality is
>>>> always much smaller, we've found that forcing a [shuffle] dimension join is
>>>> actually faster since it only sends dims once rather than all to all nodes.
>>>> The degenerative performance of broadcast is especially obvious when the
>>>> query returns zero results. I don't have much experience here, but it does
>>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's
>>>> query plan.
>>>>
>>>> -Cliff
>>>>
>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <
>>>> mauri...@impact.com> wrote:
>>>>
>>>>> @Todd not to belabor the point, but when I suggested breaking up small
>>>>> dim tables into multiple parquet files (and in this thread's context
>>>>> perhaps partition kudu table, even if small, into multiple tablets), it 
>>>>> was
>>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>>
>>>>> For example recently we ran into this slow query where the 14M record
>>>>> dimension fit into a single file & block, so it got scanned on a single
>>>>> node though still pretty quickly (300ms), however it caused the join to
>>>>> take 25+ seconds and bogged down the entire query.  See highlighted
>>>>> fragment and its parent.
>>>>>
>>>>> So we broke it into several small files the way I described in my
>>>>> previous post, and now join and query are fast (6s).
>>>>>
>>>>> -m
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon 
>>>>> wrote:
>>>>>
>>>>>> I suppose in the case that the dimension table scan makes a
>>>>>> non-trivial portion of your workload time, then yea, parallelizing the 
>>>>>> scan
>>>>>> as you suggest would be beneficial. That said, in typical analytic 
>>>>>> queries,
>>>>>> scanning the dimension tables is very quick compared to scanning the
>>>>>> much-larger fact tables, so the extra parallelism on the dim table scan
>>>>>> isn't worth too much.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
>>>>>> mauri...@impactradius.com> wrote:
>>>>>>
>>>>>>> @Todd I know working with parquet in the past I've seen small
>>>>>>> dimensions that fit in 1 single file/block limit parallelism of
>>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread
>>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or 
>>>

Re: spark on kudu performance!

2018-07-05 Thread Todd Lipcon
On Mon, Jun 11, 2018 at 5:52 AM, fengba...@uce.cn  wrote:

> Hi:
>
>  I use kudu official website development documents, use
> spark analysis kudu data(kudu's version is 1.6.0):
>
> the official  code is :
> *val df = sqlContext.read.options(Map("kudu.master" ->
> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu // Query using the
> Spark API... df.select("id").filter("id" >= 5).show()*
>
>
> My question  is :
> (1)If I use the official website code, when creating
> data collection of df, the data of my table is about 1.8
> billion, and then the filter of df is performed. This is
> equivalent to loading 1.8 billion data into memory each
> time, and the performance is very poor.
>

That's not correct. Data frames are lazy-evaluated, so when you use a
filter like the above, it does not fully materialize the whole data frame
into memory before it begins to filter.

You can also use ".explain()" to see whether the filter you are specifying
is getting pushed down properly to Kudu.


>
> (2)Create a time-based range partition on the 1.8 billion
> table, and then directly use the underlying java api,scan
> partition to analyze, this is not the amount of data each
> time loading is the specified number of partitions instead
> of 1.8 billion data?
>
> Please give me some suggestions, thanks!
>
>
The above should happen automatically so long as the filter predicate has
been pushed down. Using 'explain()' and showing us the results, along with
the code you used to create your table, will help understand what might be
the problem with performance.

-Todd
--
Todd Lipcon
Software Engineer, Cloudera


Re: Adding a new kudu master

2018-07-05 Thread Todd Lipcon
On Mon, Jul 2, 2018 at 4:36 AM, Sergejs Andrejevs 
wrote:

> Hi there,
>
>
>
> I am very glad Kudu is evolving so rapidly. Thanks for your contributions!
>
>
>
> I have faced with a challenge:
>
> I need to upgrade (reinstall) prod servers, where 3 Kudu masters are
> running. What would be the best way to do it from Kudu perspective?
>

Will you be changing the hostnames of the servers or just reinstalling the
OS but otherwise keeping the same configuration?

Have you partitioned the servers with separate OS and data disks? If so,
you can likely just reinstall the OS without reformatting the data disks.
When the OS has been reinstalled, simply install Kudu again, use the same
configuration as before to point at the existing data directories, and
everything should be fine.


> If it is not officially supported yet, could you advise a way, which
> minimizes the risks?
>
>
>
> Environment/conditions:
>
> Cloudera 5.14
>
> Kudu 1.6
>
> High-level procedure: remove 1 server from cluster, upgrade, return back
> to CM cluster, check, proceed with the next server.
>
> Some downtime is possible (let’s say < 1h)
>

I can't give any particular advise on how this might interact with Cloudera
Manager. I think the Cloudera community forum probably is a more
appropriate spot for that. But, from a Kudu-only perspective, it should be
fine to have a mixed-OS cluster where one master has been upgraded before
the others. Just keep the data around when you reinstall.


>
>
> Approach:
>
> I have already tried out at test cluster the steps, which were used to
> migrate from a single-master to multi-master cluster (see the plan below).
> However, there was a remark not to use it in order to add new nodes for 3+
> master cluster.
> Therefore, what could be an alternative way? If no alternatives, what
> could be the extra steps to pay additional attention to check the status if
> Kudu cluster is in a good shape?
> Any comments/suggestions are extremely appreciated as well.
>
>
>
> Current plan:
>
> 0.   Cluster check
>
> 1.   Stop all masters (let’s call them master-1, master-2, master-3).
>
> 2.   Remove from CM one Kudu master, e.g. master-3.
>
> 3.   Update raft meta by removing “master-3” from Kudu cluster (to be
> able to restart Kudu):
> *sudo -u kudu kudu local_replica cmeta rewrite_raft_config
>  1234567890:master-1:7051
> 0987654321:master-2:7051*
> By the way, do I understand right that tablet_id
>  is a special, containing cluster meta
> info?
>
> 4.   Start all masters. From now Kudu temporary consists of 2 masters.
>

Why bother removing the master that's down? If you can keep its data
around, and it will come back with the same hostname, there's no need to
remove it. You could simply shut down the node and be running with 2/3
servers up, which would give you the same reliability as using 2/2 without
the extra steps.


> 5.   Cluster check.
>
> 6.   Upgrade the excluded server
>
> 7.   Stop all masters.
>
> 8.   Prepare “master-3” as Kudu master:
>
> *sudo -u kudu kudu fs format --fs_wal_dir=…  --fs_data_dirs=… sudo -u kudu
> kudu fs dump uuid --fs_wal_dir=…  --fs_data_dirs=… 2>/dev/null*
> Let’s say obtained id is 77.
> Add master-3 to CM.
>
> 9.   Run metainfo update at existing masters, i.e. master-1 and
> master-2:
>
> *sudo -u kudu kudu local_replica cmeta rewrite_raft_config
>  1234567890:master-1:7051
> 0987654321:master-2:7051 77:master-3:7051*
>
> 10.   Start one master, e.g. master-1.
>
> Copy the current cluster state from master-1 to master-3:
> *sudo -u kudu kudu local_replica copy_from_remote --fs_wal_dir=…
> --fs_data_dirs=…  1234567890:master-1:7051*
>
> 11.   Start remaining Kudu masters: master-2 and master-3.
>
> 12.   Cluster check.
>
>
>
> * Optionally, at first there may be added 1 extra node (to increase from 3
> to 4 the initial number of Kudu masters, so that after removal of 1 node
> there are still HA with quorum of 3 masters). In this case steps 7-12
> should be repeated and additionally HiveMetaStore update should be executed:
>
> *UPDATE hive_meta_store_database.TABLE_PARAMS*
>
> *SET PARAM_VALUE = 'master-1,master-2,master-3'*
>
> *WHERE PARAM_KEY = 'kudu.master_addresses' AND PARAM_VALUE =
> 'master-1,master-2,master-3,master-4';*
>
>   After upgrades, the master-4 node to be removed by running
> steps 1-5.
>
>
>
> Thanks!
>
>
>
> Best regards,
>
> *Sergejs Andrejevs*
>
> Information about how we process personal data
> <http://www.intrum.com/privacy>
>
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: kudu Insert、Update、Delete operating data lost

2018-06-15 Thread Todd Lipcon
Hi,

I'm having trouble understanding your question. Can you give an example of
the operations you are trying and why you believe data is being lost?

-Todd

On Thu, Jun 14, 2018 at 8:24 PM, 秦坤  wrote:

> hello:
> I use java scan api to operate kudu in large batches
> If a session contains Insert, Update, Delete operations, if
> the database does not exist in the data there will be
> some new data loss, how to avoid such problems.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: will upsert have bad effect on scan performace?

2018-05-21 Thread Todd Lipcon
Hi Andy,

An upsert of a row that does not exist is exactly the same as an insert.

You can think of upsert as:

try {
  insert the row
} catch (Already Exists) {
  update the row
}

In reality, the conversion from insert to update is a bit more efficient
compared to doing the above yourself (and it's atomic). But, in terms of
performance, once the row has been inserted, it is the same as any other
row.

-Todd

On Mon, May 21, 2018 at 3:14 AM, Andy Liu <emmanual@gmail.com> wrote:

> Thanks in advance.
> hi, i have used java upsert api to load data instead of insert api.
> will it have a bad effect even though these data were firstly loaded.
> i do not know compaction mechanism of kudu, will it lead to many
> compaction, thus lead to bad scan performance.
>
> Best regards.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: scan performance super bad

2018-05-14 Thread Todd Lipcon
t;,
> PARTITION "61" <= VALUES < "615000",
> PARTITION "615000" <= VALUES < "62",
> PARTITION "62" <= VALUES < "625000",
> PARTITION "625000" <= VALUES < "63",
> PARTITION "63" <= VALUES < "635000",
> PARTITION "635000" <= VALUES < "64",
> PARTITION "64" <= VALUES < "645000",
> PARTITION "645000" <= VALUES < "65",
> PARTITION "65" <= VALUES < "655000",
> PARTITION "655000" <= VALUES < "66",
> PARTITION "66" <= VALUES < "665000",
> PARTITION "665000" <= VALUES < "67",
> PARTITION "67" <= VALUES < "675000",
> PARTITION "675000" <= VALUES < "68",
> PARTITION "68" <= VALUES < "685000",
> PARTITION "685000" <= VALUES < "69",
> PARTITION "69" <= VALUES < "695000",
> PARTITION "695000" <= VALUES < "70",
> PARTITION "70" <= VALUES < "705000",
> PARTITION "705000" <= VALUES < "71",
> PARTITION "71" <= VALUES < "715000",
> PARTITION "715000" <= VALUES < "72",
> PARTITION "72" <= VALUES < "725000",
> PARTITION "725000" <= VALUES < "73",
> PARTITION "73" <= VALUES < "735000",
> PARTITION "735000" <= VALUES < "74",
> PARTITION "74" <= VALUES < "745000",
> PARTITION "745000" <= VALUES < "75",
> PARTITION "75" <= VALUES < "755000",
> PARTITION "755000" <= VALUES < "76",
> PARTITION "76" <= VALUES < "765000",
> PARTITION "765000" <= VALUES < "77",
> PARTITION "77" <= VALUES < "775000",
> PARTITION "775000" <= VALUES < "78",
> PARTITION "78" <= VALUES < "785000",
> PARTITION "785000" <= VALUES < "79",
> PARTITION "79" <= VALUES < "795000",
> PARTITION "795000" <= VALUES < "80",
> PARTITION "80" <= VALUES < "805000",
> PARTITION "805000" <= VALUES < "81",
> PARTITION "81" <= VALUES < "815000",
> PARTITION "815000" <= VALUES < "82",
> PARTITION "82" <= VALUES < "825000",
> PARTITION "825000" <= VALUES < "83",
> PARTITION "83" <= VALUES < "835000",
> PARTITION "835000" <= VALUES < "84",
> PARTITION "84" <= VALUES < "845000",
> PARTITION "845000" <= VALUES < "85",
> PARTITION "85" <= VALUES < "855000",
> PARTITION "855000" <= VALUES < "86",
> PARTITION "86" <= VALUES < "865000",
> PARTITION "865000" <= VALUES < "87",
> PARTITION "87" <= VALUES < "875000",
> PARTITION "875000" <= VALUES < "88",
> PARTITION "88" <= VALUES < "885000",
> PARTITION "885000" <= VALUES < "89",
> PARTITION "89" <= VALUES < "895000",
> PARTITION "895000" <= VALUES < "90",
> PARTITION "90" <= VALUES < "905000",
> PARTITION "905000" <= VALUES < "91",
> PARTITION "91" <= VALUES < "915000",
> PARTITION "915000" <= VALUES < "92",
> PARTITION "92" <= VALUES < "925000",
> PARTITION "925000" <= VALUES < "93",
> PARTITION "93" <= VALUES < "935000",
> PARTITION "935000" <= VALUES < "94",
> PARTITION "94" <= VALUES < "945000",
> PARTITION "945000" <= VALUES < "95",
> PARTITION "95" <= VALUES < "955000",
> PARTITION "955000" <= VALUES < "96",
> PARTITION "96" <= VALUES < "965000",
> PARTITION "965000" <= VALUES < "97",
> PARTITION "97" <= VALUES < "975000",
> PARTITION "975000" <= VALUES < "98",
> PARTITION "98" <= VALUES < "985000",
> PARTITION "985000" <= VALUES < "99",
> PARTITION "99" <= VALUES < "995000",
> PARTITION VALUES >= "995000"
> )
>
>
>
So it looks like you have a numeric value being stored here in the string
column. Are you sure that you are properly zero-padding when creating your
key? For example if you accidentally scan from "50_..." to "80_..." you
will end up scanning a huge portion of your table.


> i did not delete rows in this table ever.
>
> my scanner code is below:
> buildKey method will build the lower bound and the upper bound, the unique
> id is same, the startRow offset(third part) is 0, and the endRow offset is
> , startRow and endRow only differs from time.
> though the max offset is big(999), generally it is less than 100.
>
> private KuduScanner buildScanner(Metric startRow, Metric endRow, 
> List dimensionIds, List dimensionFilterList) {
> KuduTable kuduTable = 
> kuduService.getKuduTable(BizConfig.parseFrom(startRow.getBizId()));
>
> PartialRow lower = kuduTable.getSchema().newPartialRow();
> lower.addString("key", buildKey(startRow));
> PartialRow upper = kuduTable.getSchema().newPartialRow();
> upper.addString("key", buildKey(endRow));
>
> LOG.info("build scanner. lower = {}, upper = {}", buildKey(startRow), 
> buildKey(endRow));
>
> KuduScanner.KuduScannerBuilder builder = 
> kuduService.getKuduClient().newScannerBuilder(kuduTable);
> builder.setProjectedColumnNames(COLUMNS);
> builder.lowerBound(lower);
> builder.exclusiveUpperBound(upper);
> builder.prefetching(true);
> builder.batchSizeBytes(MAX_BATCH_SIZE);
>
> if (CollectionUtils.isNotEmpty(dimensionFilterList)) {
> for (int i = 0; i < dimensionIds.size() && i < MAX_DIMENSION_NUM; 
> i++) {
> for (DimensionFilter dimensionFilter : dimensionFilterList) {
> if (!Objects.equals(dimensionFilter.getDimensionId(), 
> dimensionIds.get(i))) {
> continue;
> }
> ColumnSchema columnSchema = 
> kuduTable.getSchema().getColumn(String.format("dimension_%02d", i));
> KuduPredicate predicate = buildKuduPredicate(columnSchema, 
> dimensionFilter);
> if (predicate != null) {
> builder.addPredicate(predicate);
> LOG.info("add predicate. predicate = {}", 
> predicate.toString());
> }
> }
> }
> }
> return builder.build();
> }
>
>
What client version are you using? 1.7.0?


> i checked the metrics, only get content below, it seems no relationship
> with my table.
>

Looks like you got the metrics from the kudu master, not a tablet server.
You need to figure out which tablet server you are scanning and grab the
metrics from that one.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: scan performance super bad

2018-05-13 Thread Todd Lipcon
Can you share the code you are using to create the scanner and call
nextRows?

Can you also copy-paste the info provided on the web UI of the kudu master
for this table? It will show the schema and partitioning information.

Is it possible that your table includes a lot of deleted rows? i.e did you
load the table, then delete all the rows, then load again? This can cause
some performance issues in current versions of Kudu as the scanner needs to
"skip over" the deleted rows before it finds any to return.

Based on your description I would expect this to be doing a simple range
scan for the returned rows, and return in just a few milliseconds. The fact
that it is taking 500ms implies that the server is scanning a lot of
non-matching rows before finding a few that match. You can also check the
metrics:

http://my-tablet-server:8050/metrics?metrics=scanner_rows

and compare the 'rows scanned' vs 'rows returned' metric. Capture the
values both before and after you run the query, and you should see if
'rows_scanned' is much larger than 'rows_returned'.

-Todd

On Sun, May 13, 2018 at 12:56 AM, 一米阳光 <710339...@qq.com> wrote:

> hi, i have faced a difficult problem when using kudu 1.6.
>
> my kudu table schema is generally like this:
> column name:key, type:string, prefix encoding, lz4 compression, primary key
> column name:value, type:string, lz4 compression
>
> the primary key is built from several parts:
> 001320_201803220420_0001
> the first part is a unique id,
> the second part is time format string,
> the third part is incremental integer(for a unique id and an fixed time,
> there may exist multi value, so i used this part to distinguish
> <http://dict.youdao.com/w/distinguish/#keyfrom=E2Ctranslation>)
>
> the table range partition use the first part, split it like below
> range<005000
> 005000<= range <01
> 01<= range <015000
> 015000<= range <02
> .
> .
> 995000<= range
>
> when i want to scan data for a unique id and range of time, the lower
> bound like 001320_201803220420_0001 and the higher bound like
> 001320_201803230420_, it takes about 500ms to call
> kuduScanner.nextRows() and the number of rows it returns is between 20~50.
> All size of data between the bound is about 8000, so i should call hundreds
> times nextRows() to fetch all data, and it finally cost several minutes.
>
> i don't know why this happened and how to resolve itmaybe the final
> solution is that i should giving up kudu, using hbase instead...
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Kudu read - performance issue

2018-05-11 Thread Todd Lipcon
On Fri, May 11, 2018 at 12:05 PM, Todor Petrov <todor.pet...@rms.com> wrote:

> Hi there,
>
>
>
> I have an interesting performance issue reading from Kudu. Hopefully there
> is a good explanation for it because the difference in the performance is
> quite significant and it puzzles me a lot.
>
>
>
> Basically we have a table with the following schema:
>
>
>
> *Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION*
>
> *Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION*
>
> *…. (a bunch of int32 and int16 columns)*
>
>
>
> *PK is (Column1, Column2)*
>
> *HASH(Column1) PARTITIONS 4*
>
>
>
> The number of records is *~60M*. *~5K* distinct Column1 values. *~1.4M*
> distinct values for Column2.
>
>
>
> All tests are made on one core. I think the hardware specs are not
> important.
>
>
>
> 1)  If we query all data using
>
>
>
> *  val scanner = *
>
> *kuduClient.getAsyncScannerBuilder(table)*
>
> *
> .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
> ComparisonOp.EQUAL, column1Value)).build()*
>
>
>
> We use 3 scanners in parallel (one query for each unique value of column1).
>
>
>
> All fields from the returned rows are read and some internal structures
> are built.
>
>
>
> In this case, it takes *~40 sec* to load all the data.
>
>
>
> 2)  If we query using “InListPredicate”, then the performance is
> super slow.
>
>
>
> *  val scanner = *
>
> *kuduClient.getAsyncScannerBuilder(table)*
>
> *
> .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
> ComparisonOp.EQUAL, column1Value))*
>
> *  .addPredicate(KuduPredicate.newInListPredicate(Column2Schema,
> column2Values.asJava)).build()*
>
>
>
> Same as in 1), 3 scanners in parallel, all records are read and some
> in-memory structures are built. This time column2 values are split into a
> bunch of chunks and we send a request for each unique value of column1 and
> each chunk of column2 values.
>

Are you sorting the values of 'column2' before doing the chunking? Kudu
doesn't use indexes for evaluating IN-list predicates except for using the
min(in-list-values) and max(in-list-values). So, if you had for example:

pre-chunk in-list: 1,2,3,4,5,6
chunk 1: col2 IN (1,6)
chunk 2: col2 IN (2,5)
chunk 3: col2 IN (3,4)

then you will actually scan over the middle portion of that table 3 times.

If you sort the in-list before chunking you'll avoid the multiple-scan
effect here.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Kudu Exception -Couldnot find any valid location.

2018-04-16 Thread Todd Lipcon
Hi Pranab,

I see you got some help on Slack as well, but just to bring the answer
here in case anyone else hits this issue: it sounds like your client
is able to reach the kudu master process but unable to resolve the
address of the kudu tablet servers. This might be because the tablet
servers do not have a consistent DNS setup with your client -- eg they
are using EC2-internal hostnames which are not resolvable from outside
EC2. You can likely resolve this by setting up some kind of VPN into
your EC2 VPC which includes DNS forwarding, but I'm not particularly
knowledgeable about that aspect of AWS.

-Todd

On Sun, Apr 15, 2018 at 11:35 AM, Pranab Batsa <prba...@gmail.com> wrote:
> Hai, Can someone help me out. I am trying to connect my local system to kudu
> on EC2, I am able to connect to the master and do operations like creating
> table and all, but i am not able to do any operation on the table like
> insert in to the table, It throws exception like :Couldnot find any valid
> location. Unkown host exception. Thanks in advance for your valuable time.



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: question about kudu performance!

2018-04-13 Thread Todd Lipcon
> On Tue, Apr 3, 2018 at 7:38 PM, fengba...@uce.cn <fengba...@uce.cn> wrote:
> >
> > (1)version
> > The component version is:
> >
> >- CDH 5.14
> >- Kudu 1.6
> >
> > (2)Framework
> >
> > The size of the kudu cluster(total 10 machines,256G mem,16*1.2T sas disk):
> >
> > -3 master node
> >
> > -7 tablet server node
> >
> > Independent deployment of master and tablet server,but Yarn nodemanger and 
> > tablet server are deployed together
> >
> >  (3) kudu parameter :
> >
> > maintenance_manager_num_threads=24(The number of data directories for each 
> > machine disk is 8)

If you have 16 disks, why only 8 directories?

I would recommend reducing this significantly. We usually recommend
one thread for every 3 disks.

> >
> > memory_limit_hard_bytes=150G
> >
> > I have a performance problem: every 2-3 weeks, clusters start to make 
> > MajorDeltaCompactionOp, when kudu performs insert and update performance 
> > decreases, when the data is written, the update operation almost stops.
>

The major delta compactions should actually improve update
performance, not decrease it. Do you have any more detailed metrics to
explain the performance drop?

If you upgrade to Kudu 1.7 the tservers will start to produce a
diagnostics log. If you can send a diagnostics log segment from the
point in time when the performance problem is occurring we can try to
understand this behavior better.

>
> > Is it possible to adjust the --memory_limit_hard_bytes parameter to 
> > 256G*80%(my yarn nm and TS are deployed together)?

If YARN is also scheduling work on these nodes, then you may end up
swapping and that would really kill performance. I usually don't see
improvements in Kudu performance by providing such huge amounts of
memory. The one exception would be that you might get some improvement
using a large block cache if your existing cache is showing a low hit
rate. The metrics would help determine that.

> >
> > Can we adjust the parameter --tablet_history_max_age_sec to shorten the 
> > MajorDeltaCompactionOp interval?

Nope, that won't affect the major delta compaction frequency. The one
undocumented tunable that is relevant is
--tablet_delta_store_major_compact_min_ratio (default 0.1). Raising
this would decrease the frequency of major delta compaction, but I
think there is likely something else going on here.

-Todd

> >
> > Can you give me some suggestions to optimize this performance problem?

Usually the best way to improve performance is by thinking carefully
about schema design, partitioning, and workload, rather than tuning
configuration. Maybe you can share more about your workload, schema,
and partitioning.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Limitations on total amount of data stored in one kudu table

2018-03-21 Thread Todd Lipcon
On Tue, Mar 20, 2018 at 2:15 AM, Кравец Владимир Александрович <
krav...@kamatech.ru> wrote:

> Hi, I'm new to Kudu and I'm trying to understand the applicability for our
> purposes. So I met the following article about the kudu limitations -
> https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_
> limitations.html#concept_cws_n4n_5z. Do I understand correctly that this
> means that the maximum total amount of usefull compressed stored data in
> one kudu-table  is 8TB? Here my calcs:
>

I think there are a few mistakes below. Comments lineline.


> 1. Amount of stored data per tablet = Recommended maximum amount of stored
> data / Recommended maximum number of tablets per tablet server = 8 000 / 2
> 000 = 4 GB per tablet
>

That assumes that every tablet is equally sized and that you have hit the
limit on number of tablets. Even though you _can_ have 2000 tablets per
server, you might want fewer. In addition, you don't need to have every
tablet be the same size -- some might be 10GB while others might be 1GB or
smaller.


> 2. Maximum number of tablets per table for each tablet server
> pre-replication = Maximum number of tablets per table for each tablet
> server is 60, post-replication / number of replicas = 60 / 3 = 20 tablets
> per table per tablet server
>

The key word that you didn't copy here is "at table-creation time". This
limitation has to do with avoiding some issues we have seen when trying to
create too many tablets at the same time on the cluster. With range
partitioning, you can always add more partitions later. For example it's
very common to add a new partition for each day. So, a single table can,
after some days, have more than 20 tablets on a given server.


> 3. Total amount of stored data per table, pre-replication = Amount of
> stored data per tablet * Maximum number of tablets per table for each
> tablet server pre-replication *  Maximum number of tablet servers = 4 GB *
> 20 * 100 = 8TB
>

Per above, this isn't really the case. For example, on one cluster at
Cloudera which runs an internal workload, we have one table that is 82TB
and another which is 46TB. I've seen much larger tables in some user
installations as well.


> And I also would like to understand how fundamental the nature of the
> limitation "Maximum number of tablets per table for each tablet server is
> 60, post-replication"? Is it possible that this restriction will be removed?
>

See above.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: 答复: A few questions for using Kudu

2018-03-19 Thread Todd Lipcon
On Thu, Mar 15, 2018 at 8:32 PM, 张晓宁 <zhangxiaon...@jd.com> wrote:

> Thank you Dan! My follow-up comments with XiaoNing.
>
>
>
> *发件人:* Dan Burkert [mailto:danburk...@apache.org]
> *发送时间:* 2018年3月16日 1:06
> *收件人:* user@kudu.apache.org
> *主题:* Re: A few questions for using Kudu
>
>
>
> Hi, answers inline:
>
> On Thu, Mar 15, 2018 at 3:12 AM, 张晓宁 <zhangxiaon...@jd.com> wrote:
>
> I have a few questions for using kudu:
>
> 1.   As more and more data inserted to kudu, the performance
> decrease. After continuous data insertion for about 30 minutes, the TPS
> performance decreased with 20%, and after 1-hour data insertion, the
> performance decreased with 40%. Is this a known issue?
>
> This is expected if you are inserting data in random order.  If you try
> another benchmark where you insert data in primary key sorted order, you'll
> see that the performance will be much higher, and more consistent.  If you
> have a heavy insert workload, this kind of optimization is critical.  The
> table's partitioning and primary key can often be designed to make this
> happen naturally, but it's a dataset dependent thing, so without more
> specifics about your data it's difficult to give more precise advice.
>
>  XiaoNing: Our table has 2 partitions,the first level partition is by
> date range(using the column timestamp),one partition for one single day,
> and the second partition is by a hash on 2 column(key + host).These 3
> columns(timestamp,key,host) are the primary key of the table.For you
> comment “insert data in primary key sorted order”,do you mean we need to
> sort the data on the 3 primary-key columns before insertion?
>

If timestamp is the first column then it should probably be somewhat
naturally-sorted by the primary key, right? It doesn't need to be perfectly
sorted, but if the inserts are in roughly PK order, we will avoid
unnecessary compaction.


> 2.   When setting the replica number to be 1, totally I will have 2
> copy of data(1 master data + 1 replica data), is this true?
>
> That's incorrect.  The master node does not hold any table data.  If you
> set the number of replicas to be 1, you will lose data if you lose the
> tablet server which holds the replica.  We always recommend production
> workloads set number of replicas to 3 in order to have fault tolerance.
>
>  XiaoNing: So if we want to have fault tolerance, we should at least set
> the replica number to be 3, right?
>

That's right.

-Todd
--
Todd Lipcon
Software Engineer, Cloudera


Re: Kudu client close exception

2018-03-19 Thread Todd Lipcon
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.apache.kudu.client.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
>
>
> Thanks
>
> Rainerdun
>
>
>
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: "broadcast" tablet replication for kudu?

2018-03-16 Thread Todd Lipcon
I suppose in the case that the dimension table scan makes a non-trivial
portion of your workload time, then yea, parallelizing the scan as you
suggest would be beneficial. That said, in typical analytic queries,
scanning the dimension tables is very quick compared to scanning the
much-larger fact tables, so the extra parallelism on the dim table scan
isn't worth too much.

-Todd

On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
mauri...@impactradius.com> wrote:

> @Todd I know working with parquet in the past I've seen small dimensions
> that fit in 1 single file/block limit parallelism of
> join/exchange/aggregation nodes, and I've forced those dims to spread
> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or similar
> when doing INSERT OVERWRITE to load them, which then allows these
> operations to parallelize across that many nodes.
>
> Wouldn't it be useful here for Cliff's small dims to be partitioned into a
> couple tablets to similarly improve parallelism?
>
> -m
>
> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <cre...@gmail.com> wrote:
>>
>>> Hey Todd,
>>>
>>> Thanks for that explanation, as well as all the great work you're doing
>>> -- it's much appreciated! I just have one last follow-up question. Reading
>>> about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller
>>> table is always copied in its entirety BEFORE the predicate is evaluated.
>>>
>>
>> That's not quite true. If you have a predicate on a joined column, or on
>> one of the columns in the joined table, it will be pushed down to the
>> "scan" operator, which happens before the "exchange". In addition, there is
>> a feature called "runtime filters" that can push dynamically-generated
>> filters from one side of the exchange to the other.
>>
>>
>>> But since the Kudu client provides a serialized scanner as part of the
>>> ScanToken API, why wouldn't Impala use that instead if it knows that the
>>> table is Kudu and the query has any type of predicate? Perhaps if I
>>> hash-partition the table I could maybe force this (because that complicates
>>> a BROADCAST)? I guess this is really a question for Impala but perhaps
>>> there is a more basic reason.
>>>
>>
>> Impala could definitely be smarter, just a matter of programming
>> Kudu-specific join strategies into the optimizer. Today, the optimizer
>> isn't aware of the unique properties of Kudu scans vs other storage
>> mechanisms.
>>
>> -Todd
>>
>>
>>>
>>> -Cliff
>>>
>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>>>> cresn...@mediamath.com> wrote:
>>>>
>>>>> I thought I had read that the Kudu client can configure a scan for
>>>>> CLOSEST_REPLICA and assumed this was a way to take advantage of data
>>>>> collocation.
>>>>>
>>>>
>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local one if
>>>> available. However, that doesn't influence the higher level operation of
>>>> the Impala (or Spark) planner. The planner isn't aware of the replication
>>>> policy, so it will use one of the existing supported JOIN strategies. Given
>>>> statistics, it will choose to broadcast the small table, which means that
>>>> it will create a plan that looks like:
>>>>
>>>>
>>>>+-+
>>>>| |
>>>> +-->build  JOIN  |
>>>> |  | |
>>>> |  |  probe  |
>>>>  +--+  +-+
>>>>  |  |  |
>>>>  | Exchange |  |
>>>> ++ (broadcast   |  |
>>>> ||  |  |
>>>> |+--+  |
>>>> |  |
>>>>   +-+  |
>>>>   | |+

Re: "broadcast" tablet replication for kudu?

2018-03-16 Thread Todd Lipcon
On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <cre...@gmail.com> wrote:

> Hey Todd,
>
> Thanks for that explanation, as well as all the great work you're doing
> -- it's much appreciated! I just have one last follow-up question. Reading
> about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller
> table is always copied in its entirety BEFORE the predicate is evaluated.
>

That's not quite true. If you have a predicate on a joined column, or on
one of the columns in the joined table, it will be pushed down to the
"scan" operator, which happens before the "exchange". In addition, there is
a feature called "runtime filters" that can push dynamically-generated
filters from one side of the exchange to the other.


> But since the Kudu client provides a serialized scanner as part of the
> ScanToken API, why wouldn't Impala use that instead if it knows that the
> table is Kudu and the query has any type of predicate? Perhaps if I
> hash-partition the table I could maybe force this (because that complicates
> a BROADCAST)? I guess this is really a question for Impala but perhaps
> there is a more basic reason.
>

Impala could definitely be smarter, just a matter of programming
Kudu-specific join strategies into the optimizer. Today, the optimizer
isn't aware of the unique properties of Kudu scans vs other storage
mechanisms.

-Todd


>
> -Cliff
>
> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>> cresn...@mediamath.com> wrote:
>>
>>> I thought I had read that the Kudu client can configure a scan for
>>> CLOSEST_REPLICA and assumed this was a way to take advantage of data
>>> collocation.
>>>
>>
>> Yea, when a client uses CLOSEST_REPLICA it will read a local one if
>> available. However, that doesn't influence the higher level operation of
>> the Impala (or Spark) planner. The planner isn't aware of the replication
>> policy, so it will use one of the existing supported JOIN strategies. Given
>> statistics, it will choose to broadcast the small table, which means that
>> it will create a plan that looks like:
>>
>>
>>+-+
>>| |
>> +-->build  JOIN  |
>> |  | |
>> |  |  probe  |
>>  +--+  +-+
>>  |  |  |
>>  | Exchange |  |
>> ++ (broadcast   |  |
>> ||  |  |
>> |+--+  |
>> |  |
>>   +-+  |
>>   | |+---+
>>   |  SCAN   ||   |
>>   |  KUDU   ||   SCAN (other side)   |
>>   | ||   |
>>   +-++---+
>>
>> (hopefully the ASCII art comes through)
>>
>> In other words, the "scan kudu" operator scans the table once, and then
>> replicates the results of that scan into the JOIN operator. The "scan kudu"
>> operator of course will read its local copy, but it will still go through
>> the exchange process.
>>
>> For the use case you're talking about, where the join is just looking up
>> a single row by PK in a dimension table, ideally we'd be using an
>> altogether different join strategy such as nested-loop join, with the inner
>> "loop" actually being a Kudu PK lookup, but that strategy isn't implemented
>> by Impala.
>>
>> -Todd
>>
>>
>>
>>>  If this exists then how far out of context is my understanding of it?
>>> Reading about HDFS cache replication, I do know that Impala will choose a
>>> random replica there to more evenly distribute load. But especially
>>> compared to Kudu upsert, managing mutable data using Parquet is painful.
>>> So, perhaps to sum thing up, if nearly 100% of my metadata scan are single
>>> Primary Key lookups followed by a tiny broadcast then am I really just
>>> splitting hairs performance-wise between Kudu and HDFS-cached parquet?
&g

Re: "broadcast" tablet replication for kudu?

2018-03-16 Thread Todd Lipcon
It's worth noting that, even if your table is replicated, Impala's planner
is unaware of this fact and it will give the same plan regardless. That is
to say, rather than every node scanning its local copy, instead a single
node will perform the whole scan (assuming it's a small table) and
broadcast it from there within the scope of a single query. So, I don't
think you'll see any performance improvements on Impala queries by
attempting something like an extremely high replication count.

I could see bumping the replication count to 5 for these tables since the
extra storage cost is low and it will ensure higher availability of the
important central tables, but I'd be surprised if there is any measurable
perf impact.

-Todd

On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <cresn...@mediamath.com>
wrote:

> Thanks for that, glad I was wrong there! Aside from replication
> considerations, is it also recommended the number of tablet servers be odd?
>
> I will check forums as you suggested, but from what I read after searching
> is that Impala relies on user configured caching strategies using HDFS
> cache.  The workload for these tables is very light write, maybe a dozen or
> so records per hour across 6 or 7 tables. The size of the tables ranges
> from thousands to low millions of rows so so sub-partitioning would not be
> required. So perhaps this is not a typical use-case but I think it could
> work quite well with kudu.
>
> From: Dan Burkert <danburk...@apache.org>
> Reply-To: "user@kudu.apache.org" <user@kudu.apache.org>
> Date: Friday, March 16, 2018 at 2:09 PM
> To: "user@kudu.apache.org" <user@kudu.apache.org>
> Subject: Re: "broadcast" tablet replication for kudu?
>
> The replication count is the number of tablet servers which Kudu will host
> copies on.  So if you set the replication level to 5, Kudu will put the
> data on 5 separate tablet servers.  There's no built-in broadcast table
> feature; upping the replication factor is the closest thing.  A couple of
> things to keep in mind:
>
> - Always use an odd replication count.  This is important due to how the
> Raft algorithm works.  Recent versions of Kudu won't even let you specify
> an even number without flipping some flags.
> - We don't test much much beyond 5 replicas.  It *should* work, but you
> may run in to issues since it's a relatively rare configuration.  With a
> heavy write workload and many replicas you are even more likely to
> encounter issues.
>
> It's also worth checking in an Impala forum whether it has features that
> make joins against small broadcast tables better?  Perhaps Impala can cache
> small tables locally when doing joins.
>
> - Dan
>
> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <cresn...@mediamath.com
> > wrote:
>
>> The problem is, AFIK, that replication count is not necessarily the
>> distribution count, so you can't guarantee all tablet servers will have a
>> copy.
>>
>> On Mar 16, 2018 1:41 PM, Boris Tyukin <bo...@boristyukin.com> wrote:
>> I'm new to Kudu but we are also going to use Impala mostly with Kudu. We
>> have a few tables that are small but used a lot. My plan is replicate them
>> more than 3 times. When you create a kudu table, you can specify number of
>> replicated copies (3 by default) and I guess you can put there a number,
>> corresponding to your node count in cluster. The downside, you cannot
>> change that number unless you recreate a table.
>>
>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <cre...@gmail.com> wrote:
>>
>>> We will soon be moving our analytics from AWS Redshift to Impala/Kudu.
>>> One Redshift feature that we will miss is its ALL Distribution, where a
>>> copy of a table is maintained on each server. We define a number of
>>> metadata tables this way since they are used in nearly every query. We are
>>> considering using parquet in HDFS cache for these, and Kudu would be a much
>>> better fit for the update semantics but we are worried about the additional
>>> contention.  I'm wondering if having a Broadcast, or ALL, tablet
>>> replication might be an easy feature to add to Kudu?
>>>
>>> -Cliff
>>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Kudu as a Graphite backend

2018-03-05 Thread Todd Lipcon
Hey Mark,

Yea, I wrote the original Graphite integration in the samples repo several
years ago (prior to Kudu 0.5 even), but it was more of a quick prototype in
order to have a demo of the Python client rather than something meant to be
used in a production scenario. Of course with some work it could probably
be updated and made more "real".

You may also be interested in the 'kudu-ts' project that Dan Burkert
started: https://github.com/danburkert/kudu-ts

It provides an OpenTSDB-compatible interface on top of Kudu. Unfortunately
it's also somewhat incomplete but could provide a decent starting point for
a time series workload.

It would be great if you wanted to contribute to either the graphite-kudu
integration or kudu-ts. Neither is getting the love they deserve right now.

-Todd

On Mon, Mar 5, 2018 at 7:38 AM, Paul Brannan <paul.bran...@thesystech.com>
wrote:

> Do you want to use kudu as a backend for carbon (i.e. have graphite/carbon
> receive metrics and write them to kudu), or do you want to use graphite-web
> as a frontend for timeseries you already have in kudu?  Both are mostly
> straightforward; see e.g. https://github.com/criteo/
> biggraphite/tree/master/biggraphite/plugins for an example of each.
> AFAICT the examples (https://github.com/cloudera/
> kudu-examples/tree/master/python/graphite-kudu) are just a graphite-web
> finder; you'd still need a carbon backend for storage, unless you insert
> the data through a different mechanism.
>
> On Mon, Mar 5, 2018 at 6:15 AM, Mark Meyer <mark.me...@smaato.com> wrote:
>
>> Hi List,
>> has anybody experiences running Kudu as a Graphite backend in production?
>> I've been looking at the samples repository, but have been unsure,
>> primarily because of the 'samples' tag associated with the code.
>>
>> Best, Mark
>>
>> --
>> Mark Meyer
>> Systems Engineer
>> mark.me...@smaato.com
>> Smaato Inc.
>> San Francisco – New York – Hamburg – Singapore
>> www.smaato.com
>>
>> Valentinskamp 70
>> <https://maps.google.com/?q=Valentinskamp+70=gmail=g>,
>> Emporio, 19th Floor
>> 20355 Hamburg
>> T: ­0049 (40) 3480 949 0
>> F: 0049 (40) 492 19 055
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Using Kudu to Handle Huge amount of Data

2018-02-04 Thread Todd Lipcon
Hi JP,

Answers inline...

On Thu, Feb 1, 2018 at 9:45 PM, Jp Gupta <newlife...@gmail.com> wrote:

> Hi,
> As an existing HBase user, we handle close to 20TB of data everyday.
>

What does "handle" mean in this case? You are inserting 20TB of new data
each day, so that your total dataset grows by that amount? How much data do
you retain? How many nodes is your cluster? (I would guess many hundred?)


>
> While we are contemplating on moving to Kudu to take advantage of the new
> technology, I am yet to hear of an real industry use case where Kudu is
> being to used to handle of  huge amount of data.
>

If you are seeing Kudu as an "improved HBase" that isn't really accurate.
Of course there are some things we can do better than HBase, but there are
some things HBase can do better than Kudu.

As for Kudu data sizes, I am aware of some organizations storing several
hundred TB in a Kudu cluster, but I have not yet heard of a use case with
1PB+. If you are looking to run at that scale you may hit some issues, but
we are standing ready to help you overcome them. I don't see any
fundamental problems that would prevent it, and I have run some basic smoke
tests of Kudu on ~800 nodes before.


>
> Looking forward to your inputs on any organisation using Kudu where data
> volumes of more than 10 TB is ingested everyday.
>

Hope some other users can chime in.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Bulk / Initial load of large tables into Kudu using Spark

2018-01-30 Thread Todd Lipcon
On Mon, Jan 29, 2018 at 1:19 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> thank you both. Does it make a difference from performance perspective
> though if I do a bulk load through Impala versus Spark? is the Kudu client
> with Spark will be faster than Impala?
>

Impala in recent versions has some tricks it does to pre-sort and
pre-shuffle the data to avoid compactions in Kudu during the insert. Spark
does not currently have these optimizations. So I would guess that Impala
would be able to bulk load large datasets more efficiently than Spark for
the time being.

-Todd


>
> On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patr...@cloudera.com>
>> wrote:
>>
>>> Hi Boris.
>>>
>>> 1) I would like to bypass Impala as data for my bulk load coming from
>>>> sqoop and avro files are stored on HDFS.
>>>>
>>> What's the objection to Impala? In the example below, Impala reads from
>>> an HDFS-resident table, and writes to the Kudu table.
>>>
>>>
>>>> 2) we do not want to deal with MapReduce.
>>>>
>>>
>>> You can still use Spark... the MR reference is in regards to the
>>> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use
>>> these. See, for example:
>>>
>>> https://dzone.com/articles/implementing-hadoops-input-format
>>> -and-output-forma
>>>
>>
>> While that's possible I'd recommend using the dataframes API instead. eg
>> see https://kudu.apache.org/docs/developing.html#_kudu_integ
>> ration_with_spark
>>
>> That should work as well (or better) than the MR outputformat.
>>
>> -Todd
>>
>>
>>
>>> However, you'll have to write (simple) Spark code, whereas with method
>>> #1 you do effectively the same thing under the covers using SQL statements
>>> via Impala.
>>>
>>>
>>>>
>>>> Thanks!
>>>> What’s the most efficient way to bulk load data into Kudu?
>>>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>
>>>>
>>>> The easiest way to load data into Kudu is if the data is already
>>>> managed by Impala. In this case, a simple INSERT INTO TABLE
>>>> some_kudu_table SELECT * FROM some_csv_tabledoes the trick.
>>>>
>>>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
>>>> HBase, or any other data store that has an InputFormat.
>>>>
>>>> No tool is provided to load data directly into Kudu’s on-disk data
>>>> format. We have found that for many workloads, the insert performance of
>>>> Kudu is comparable to bulk load performance of other systems.
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Bulk / Initial load of large tables into Kudu using Spark

2018-01-29 Thread Todd Lipcon
On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles <patr...@cloudera.com>
wrote:

> Hi Boris.
>
> 1) I would like to bypass Impala as data for my bulk load coming from
>> sqoop and avro files are stored on HDFS.
>>
> What's the objection to Impala? In the example below, Impala reads from an
> HDFS-resident table, and writes to the Kudu table.
>
>
>> 2) we do not want to deal with MapReduce.
>>
>
> You can still use Spark... the MR reference is in regards to the
> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use
> these. See, for example:
>
> https://dzone.com/articles/implementing-hadoops-input-
> format-and-output-forma
>

While that's possible I'd recommend using the dataframes API instead. eg
see
https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark

That should work as well (or better) than the MR outputformat.

-Todd



> However, you'll have to write (simple) Spark code, whereas with method #1
> you do effectively the same thing under the covers using SQL statements via
> Impala.
>
>
>>
>> Thanks!
>> What’s the most efficient way to bulk load data into Kudu?
>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu>
>>
>> The easiest way to load data into Kudu is if the data is already managed
>> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table
>> SELECT * FROM some_csv_tabledoes the trick.
>>
>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS,
>> HBase, or any other data store that has an InputFormat.
>>
>> No tool is provided to load data directly into Kudu’s on-disk data
>> format. We have found that for many workloads, the insert performance of
>> Kudu is comparable to bulk load performance of other systems.
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: new Kudu benchmarks

2018-01-08 Thread Todd Lipcon
Thanks for making the updates. I tweeted it from my account and from
@ApacheKudu. feel free to retweet!

-Todd

On Sat, Jan 6, 2018 at 1:10 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> thanks Todd, updated my post with that info and also changes title a bit.
> thanks again for your feedback! look forward to new releases coming up!
>
> Boris
>
> On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> Hi Todd,
>>>
>>> thanks for your feedback! sure will be happy to update my post with your
>>> suggestions. I am not sure Apache Parquet will be clear though as some
>>> might understand it as using parquet files with Hive or Spark. What do you
>>> think about "Impala on Kudu vs Impala on Parquet"? Realistically, for BI
>>> users, Impala is the only option now with Kudu. Not many typical users will
>>> use Kudu API clients or even Spark and Hive serde for Kudu does not exist.
>>>
>>
>> I think "Impala on Kudu vs Parquet" or "Impala Storage Comparison: Kudu
>> vs Parquet" or something would be a reasonable title.
>>
>>
>>>
>>> As for decimals, this is exciting news. Where can I found info about
>>> timestamp support? I saw this JIRA
>>> https://issues.apache.org/jira/browse/IMPALA-5137
>>>
>>> but I was a bit confused by the actual change. It looked like a
>>> workaround to do a conversion on the fly for impala but not actually store
>>> proper timestamps in Kudu. Maybe I misread that. I thought the idea was to
>>> add a proper support in Kudu so timestamp can be used as a type with other
>>> clients not only Impala. If you can clarify that, it would be great
>>>
>>
>> What we implemented is "proper" timestamp support in Kudu, but you're
>> right that there is some conversion going on under the hood. The reasoning
>> is that Impala internally uses a 96-bit timestamp representation which
>> supports a very large range of dates at nanosecond precision. This is more
>> than is required by the SQL standard and doesn't match the timestamp
>> representation used by other ecosystem components. As far as I know, Impala
>> is planning on moving to a 64-bit timestamp representation with microsecond
>> precision, so that's what Kudu implemented internally. With 64 bits there
>> is still enough range to store dates for 584,554 years at microsecond
>> precision.
>>
>> I think https://impala.apache.org/docs/build/html/topics/impal
>> a_timestamp.html has some info about Kudu compatibility and limitations.
>>
>> -Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi Todd,
>
> thanks for your feedback! sure will be happy to update my post with your
> suggestions. I am not sure Apache Parquet will be clear though as some
> might understand it as using parquet files with Hive or Spark. What do you
> think about "Impala on Kudu vs Impala on Parquet"? Realistically, for BI
> users, Impala is the only option now with Kudu. Not many typical users will
> use Kudu API clients or even Spark and Hive serde for Kudu does not exist.
>

I think "Impala on Kudu vs Parquet" or "Impala Storage Comparison: Kudu vs
Parquet" or something would be a reasonable title.


>
> As for decimals, this is exciting news. Where can I found info about
> timestamp support? I saw this JIRA
> https://issues.apache.org/jira/browse/IMPALA-5137
>
> but I was a bit confused by the actual change. It looked like a workaround
> to do a conversion on the fly for impala but not actually store proper
> timestamps in Kudu. Maybe I misread that. I thought the idea was to add a
> proper support in Kudu so timestamp can be used as a type with other
> clients not only Impala. If you can clarify that, it would be great
>

What we implemented is "proper" timestamp support in Kudu, but you're right
that there is some conversion going on under the hood. The reasoning is
that Impala internally uses a 96-bit timestamp representation which
supports a very large range of dates at nanosecond precision. This is more
than is required by the SQL standard and doesn't match the timestamp
representation used by other ecosystem components. As far as I know, Impala
is planning on moving to a 64-bit timestamp representation with microsecond
precision, so that's what Kudu implemented internally. With 64 bits there
is still enough range to store dates for 584,554 years at microsecond
precision.

I think
https://impala.apache.org/docs/build/html/topics/impala_timestamp.html has
some info about Kudu compatibility and limitations.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
Hey Mauricio,

Answers inline below

On Fri, Jan 5, 2018 at 2:50 PM, Mauricio Aristizabal <
mauri...@impactradius.com> wrote:

> Todd, since you bring it up in this thread... what CDH version do you
> expect DECIMAL support to make it into? I recently asked Icaro Vazquez
> about it but still no news.  We're hoping it makes it into 5.14 otherwise
> according to the roadmap there might not be another minor release and we'd
> be waiting till Summer for CDH 6.
>

As this is an open source project mailing list, it would be inappropriate
for me to comment on a vendor's release schedule. Please note that Kudu is
a product of the Apache Software Foundation and the ASF doesn't have any
influence on or knowledge of Cloudera's release plans.

Of course it happens that I and many other contributors are also employees
of Cloudera, but we participate in the ASF as individuals and not
representatives of our employer, and so generally won't comment on
questions like this in this forum. Please refer to Cloudera's forums for
questions about CDH release plans, etc.


>
> And just in case we're forced to make do without DECIMAL initially, is the
> recommendation really to store as string and convert?  I was thinking of
> storing as int/long and dividing by 10 or 1000 as needed in an impala view
> over the kudu table.  Wouldn't a division be way more performant than a
> conversion from string, especially when aggregating over thousands of
> records in a report query?
>

You're right -- using an integer type and division by a power of 10 is
going to be much faster than casting from a string.  Division by a constant
would be JITted by Impala into a pretty minimal sequence of assembly
instructions (two bitshifts, an integer multiplication, and a subtraction)
which likely take about 6 cycles total. In contrast, a cast from string to
decimal probably takes many thousands of cycles.

The only downside is that if you have end users using the data they might
be confused by the integer representation whereas a string representation
would be a little clearer.

Thanks
-Todd


>
> On Fri, Jan 5, 2018 at 11:13 AM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> Oh, one other piece of feedback: maybe worth editing the title to say "vs
>> Apache Parquet" instead of "vs Apache Impala" since in all cases you are
>> using Impala as the query engine?
>>
>> -Todd
>>
>> On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> Hey Boris,
>>>
>>> Thanks for publishing this. It's a great look at how an end user
>>> evaluates Kudu. I appreciate that you cover both the pros and cons of the
>>> technology, and glad to see that your conclusion leaves you excited about
>>> Kudu :)
>>>
>>> One quick note is that I think you'll be even more pleased when you
>>> upgrade to a later version (eg Kudu 1.5). We've improved performance in
>>> several areas and also improved scalability compared to the version you're
>>> testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It
>>> might be worth noting this as an addendum to the blog post if you feel like
>>> it.
>>>
>>> -Todd
>>>
>>> On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> we just finished testing Kudu, mostly comparing Kudu to Impala on
>>>> HDFS/parquet. I wanted to share my blog post and results. We used typical
>>>> (and real) healthcare data for the test, not a synthetic data which I think
>>>> makes it is a bit more interesting.
>>>>
>>>> I welcome any feedback!
>>>>
>>>> http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
>>>>
>>>> We are really impressed with Kudu and I wanted to take an opportunity
>>>> to thank Kudu developers for such an amazing and much-needed product.
>>>>
>>>> Boris
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> *MAURICIO ARISTIZABAL*
> Architect - Business Intelligence + Data Science
> mauri...@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
> 223 E. De La Guerra St. | Santa Barbara, CA 93101
> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101=gmail=g>
>
> Overview <http://www.impactradius.com/?src=slsap> | Twitter
> <https://twitter.com/impactradius> | Facebook
> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
> <https://www.linkedin.com/company/impact-radius-inc->
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
Oh, one other piece of feedback: maybe worth editing the title to say "vs
Apache Parquet" instead of "vs Apache Impala" since in all cases you are
using Impala as the query engine?

-Todd

On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon <t...@cloudera.com> wrote:

> Hey Boris,
>
> Thanks for publishing this. It's a great look at how an end user evaluates
> Kudu. I appreciate that you cover both the pros and cons of the technology,
> and glad to see that your conclusion leaves you excited about Kudu :)
>
> One quick note is that I think you'll be even more pleased when you
> upgrade to a later version (eg Kudu 1.5). We've improved performance in
> several areas and also improved scalability compared to the version you're
> testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It
> might be worth noting this as an addendum to the blog post if you feel like
> it.
>
> -Todd
>
> On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi guys,
>>
>> we just finished testing Kudu, mostly comparing Kudu to Impala on
>> HDFS/parquet. I wanted to share my blog post and results. We used typical
>> (and real) healthcare data for the test, not a synthetic data which I think
>> makes it is a bit more interesting.
>>
>> I welcome any feedback!
>>
>> http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
>>
>> We are really impressed with Kudu and I wanted to take an opportunity to
>> thank Kudu developers for such an amazing and much-needed product.
>>
>> Boris
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: new Kudu benchmarks

2018-01-05 Thread Todd Lipcon
Hey Boris,

Thanks for publishing this. It's a great look at how an end user evaluates
Kudu. I appreciate that you cover both the pros and cons of the technology,
and glad to see that your conclusion leaves you excited about Kudu :)

One quick note is that I think you'll be even more pleased when you upgrade
to a later version (eg Kudu 1.5). We've improved performance in several
areas and also improved scalability compared to the version you're testing.
TIMESTAMP is also supported now, with DECIMAL soon to follow. It might be
worth noting this as an addendum to the blog post if you feel like it.

-Todd

On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi guys,
>
> we just finished testing Kudu, mostly comparing Kudu to Impala on
> HDFS/parquet. I wanted to share my blog post and results. We used typical
> (and real) healthcare data for the test, not a synthetic data which I think
> makes it is a bit more interesting.
>
> I welcome any feedback!
>
> http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/
>
> We are really impressed with Kudu and I wanted to take an opportunity to
> thank Kudu developers for such an amazing and much-needed product.
>
> Boris
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Data inconsistency after restart

2018-01-04 Thread Todd Lipcon
>
>>>>>>
>>>>>>
>>>>>> In general, you can use the `ksck` tool to check the health of your
>>>>>> cluster. See https://kudu.apache.org/docs/command_line_tools_referenc
>>>>>> e.html#cluster-ksck for more details. For restarting a cluster, I
>>>>>> would recommend taking down all tablet servers at once, otherwise
>>>>>> tablet
>>>>>> replicas may try to replicate data from the server that was taken
>>>>>> down.
>>>>>>
>>>>>> Hope this helped,
>>>>>> Andrew
>>>>>>
>>>>>> On Tue, Dec 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) <
>>>>>> petter.von.dolw...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Kudu users,
>>>>>>>
>>>>>>> We just started to use Kudu (1.4.0+cdh5.12.1). To make a baseline for
>>>>>>> evaluation we ingested 3 month worth of data. During ingestion we
>>>>>>> were
>>>>>>> facing messages from the maintenance threads that a soft memory
>>>>>>> limit were
>>>>>>> reached. It seems like the background maintenance threads stopped
>>>>>>> performing their tasks at this point in time. It also so seems like
>>>>>>> the
>>>>>>> memory was never recovered even after stopping ingestion so I guess
>>>>>>> there
>>>>>>> was a large backlog being built up. I guess the root cause here is
>>>>>>> that we
>>>>>>> were a bit too conservative when giving Kudu memory. After a
>>>>>>> reststart a
>>>>>>> lot of maintenance tasks were started (i.e. compaction).
>>>>>>>
>>>>>>> When we verified that all data was inserted we found that some data
>>>>>>> was missing. We added this missing data and on some chunks we got the
>>>>>>> information that all rows were already present, i.e impala says
>>>>>>> something
>>>>>>> like Modified: 0 rows, nnn errors. Doing the verification again
>>>>>>> now
>>>>>>> shows that the Kudu table is complete. So, even though we did not
>>>>>>> insert
>>>>>>> any data on some chunks, a count(*) operation over these chunks now
>>>>>>> returns
>>>>>>> a different value.
>>>>>>>
>>>>>>> Now to my question. Will data be inconsistent if we recycle Kudu
>>>>>>> after
>>>>>>> seeing soft memory limit warnings?
>>>>>>>
>>>>>>> Is there a way to tell when it is safe to restart Kudu to avoid these
>>>>>>> issues? Should we use any special procedure when restarting (e.g.
>>>>>>> only
>>>>>>> restart the tablet servers, only restart one tablet server at a time
>>>>>>> or
>>>>>>> something like that)?
>>>>>>>
>>>>>>> The table design uses 50 tablets per day (times 90 days). It is 8 TB
>>>>>>> of data after 3xreplication over 5 tablet servers.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Petter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Andrew Wong
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Andrew Wong
>>>>>
>>>>>
>>>>
>>>>
>>>
>
> --
> David Alves
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: INT128 Column Support Interest

2017-11-20 Thread Todd Lipcon
On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke <ghe...@cloudera.com> wrote:

> Thank you for the feedback. Below are some responses.
>
> Do we have a compatible SQL type to map this to in Spark SQL, Impala,
> > Presto, etc? What type would we map to in Java?
>
>
> In Java we would Map to a BigInteger. Their isn't a perfectly natural
> mapping for SQL that I know of. It has been mentioned in the past that we
> could have server side flags to disable/enable the ability to create
> columns of certain types to prevent users from creating tables that are not
> readable by certain integrations. This problem exists today with the BINARY
> column type.
>

I'm somewhat against such a configuration. This being a server-side
configuration results in Kudu deployments in different environments having
different sets of available types, which seems very difficult for
downstream users to deal with. Even though "least common denominator" kind
of sucks, it's also not a bad policy for software that aims to be part of a
pretty diverse ecosystem.



>
> > Why not just _not_ expose it and only expose decimal.
>
>
> Technically decimal only supports 28 9's where INT128 can support slightly
> larger numbers. Their may also be more overhead dealing with a decimal
> type. Though I am not positive about that.
>

I think without clear user demand for >28 digits it's just not worth the
complexity.


>
> Encoders: like Dan mentioned, it seems like we might not be able to do a
> > very efficient job of encoding these very large integers. Stuff like
> > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
> > values. So, I'm a little afraid that we'll end up only with PLAIN and
> > people will be upset with the storage overhead and performance.
>
>
>  Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
>
>
> We will need to ensure performant encoding exists for INT128 to make
> decimals with a precisions >= 18 work well anyway. We should likely have
> parity
> with the other integer types to reduce any confusion about differing
> precisions having different encoding considerations. Although Presto
> documents that precision >= 18 are slower than the others. We could do
> something similar and follow on with improvements.
>
> In the current int128 internal patch I know that the RLE doesn't work for
> int128. I don't have a lot of background on Kudu's encoding details, so
> investigating encodings further is one of my next steps.
>

That's a good point. However, I'm guessing that users are more likely to
intuitively know that "9 digits is enough" more easily than they will know
that "64 bits is enough". In my experience people underestimate the range
of 64-bit integers and might choose INT128 if available even if they have
no need for anywhere near that range.

-Todd


>
> On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert <danburk...@apache.org>
> wrote:
>
> > Aren't we going to need efficient encodings in order to make decimal work
> > well, anyway?
> >
> > - Dan
> >
> > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon <t...@cloudera.com> wrote:
> >
> >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <danburk...@apache.org>
> >> wrote:
> >>
> >> > I think it would be useful.  As far as I've seen the main costs in
> >> > carrying data types are in writing performant encoders, and updating
> >> > integrations to work with them.  I'm guessing with 128 bit integers
> >> there
> >> > would be some integrations that can't or won't support it, which might
> >> be a
> >> > cause for confusion.  Overall, though, I think the upsides of
> efficiency
> >> > and decreased storage space are compelling.   Do you have a sense yet
> of
> >> > what encodings are going to be supported down the road (will we get to
> >> full
> >> > parity with 32/64)?
> >> >
> >>
> >> Yea, my concerns are:
> >>
> >> 1) Integrations: do we have a compatible SQL type to map this to in
> Spark
> >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems
> like
> >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
> >> we're going to map it the same as decimal anyway, why not just _not_
> >> expose
> >> it and only expose decimal? If someone wants to store a 128-bit hash as
> a
> >> DECIMAL(39) they are free to, of course. Postgres's built-in int types
> >> only
> >> go up to 64-bit (bigint)
> >>
> >> In addition to the choice of

Re: INT128 Column Support Interest

2017-11-16 Thread Todd Lipcon
On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert <danburk...@apache.org> wrote:

> I think it would be useful.  As far as I've seen the main costs in
> carrying data types are in writing performant encoders, and updating
> integrations to work with them.  I'm guessing with 128 bit integers there
> would be some integrations that can't or won't support it, which might be a
> cause for confusion.  Overall, though, I think the upsides of efficiency
> and decreased storage space are compelling.   Do you have a sense yet of
> what encodings are going to be supported down the road (will we get to full
> parity with 32/64)?
>

Yea, my concerns are:

1) Integrations: do we have a compatible SQL type to map this to in Spark
SQL, Impala, Presto, etc? What type would we map to in Java? It seems like
the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if
we're going to map it the same as decimal anyway, why not just _not_ expose
it and only expose decimal? If someone wants to store a 128-bit hash as a
DECIMAL(39) they are free to, of course. Postgres's built-in int types only
go up to 64-bit (bigint)

In addition to the choice of DECIMAL, for things like fixed-length binary
maybe we are better off later adding a fixed-length BINARY type, like
BINARY(16) which could be used for storing large hashes? There is precedent
for fixed-length CHAR(n) in SQL, but no such precedent for int128.


2) Encoders: like Dan mentioned, it seems like we might not be able to do a
very efficient job of encoding these very large integers. Stuff like
bitshuffle, SIMD bitpacking, etc, isn't really designed for such large
values. So, I'm a little afraid that we'll end up only with PLAIN and
people will be upset with the storage overhead and performance.

-Todd

>
> On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke <ghe...@cloudera.com> wrote:
>
>> Hi all,
>>
>> As a part of adding DECIMAL support to Kudu it was necessary to add
>> internal support for 128 bit integers. Taking that one step further and
>> supporting public columns and APIs for 128 bit integers would not be too
>> much additional work. However, I wanted to gauge the interest from the
>> community.
>>
>> My initial thoughts are that having an INT128 column type could be useful
>> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types
>> of data.
>>
>> Is there any interest or uses for a INT128 column type? Is anyone
>> currently using a STRING or BINARY column for 128 bit data?
>>
>> Thank you,
>> Grant
>> --
>> Grant Henke
>> Software Engineer | Cloudera
>> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.

2017-11-03 Thread Todd Lipcon
One thing you might try is to update the consensus rpc timeout to 30
seconds instead of 1. We changed the default in later versions.

I'd also recommend updating up 1.4 or 1.5 for other related fixes to
consensus stability. I think I recall you were on 1.3 still?

Todd


On Nov 3, 2017 7:47 PM, "Lee King"  wrote:

Hi,
Our kudu cluster have ran well a long time,  but write became slowly
recently,client also come out rpc timeout. I check the warning and find
vast error look this:
W1104 10:25:16.833736 10271 consensus_peers.cc:365] T
149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> Peer
1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't send
request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet
149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC to
10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next
heartbeat period. Already tried 5 times.
I change the configure
rpc_service_queue_length=400,rpc_num_service_threads=40,
but it takes no effect.
Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I
check one of the ts machine's memory, 14G left(128 In all), thread 4739(max
32000), openfile 28000(max 65536), cpu disk utilization ratio about 30%(32
core), disk util  less than 30%.
Any suggestion for this? Thanks!


Re: Error message: 'Tried to update clock beyond the max. error.'

2017-11-01 Thread Todd Lipcon
Actually I think I understand the root cause of this. I think at some point
NTP can switch the clock from a microseconds-based mode to a
nanoseconds-based mode, at which point Kudu starts interpreting the results
of the ntp_gettime system call incorrectly, resulting in incorrect error
estimates and even time values up to 1000 seconds in the future (we read 1
billion nanoseconds as 1 billion microseconds (=1000 seconds)). I'll work
on reproducing this and a patch, to backport to previous versions.

-Todd

On Wed, Nov 1, 2017 at 5:00 PM, Todd Lipcon <t...@cloudera.com> wrote:

> What's the full log line where you're seeing this crash? Is it coming from
> tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
>
> -Todd
>
> 2017-11-01 15:45 GMT-07:00 Franco Venturi <fvent...@comcast.net>:
>
>> Our version is kudu 1.5.0-cdh5.13.0.
>>
>> Franco
>>
>>
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Error message: 'Tried to update clock beyond the max. error.'

2017-11-01 Thread Todd Lipcon
What's the full log line where you're seeing this crash? Is it coming from
tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?

-Todd

2017-11-01 15:45 GMT-07:00 Franco Venturi <fvent...@comcast.net>:

> Our version is kudu 1.5.0-cdh5.13.0.
>
> Franco
>
>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Low ingestion rate from Kafka

2017-11-01 Thread Todd Lipcon
On Wed, Nov 1, 2017 at 2:10 PM, Chao Sun <sunc...@uber.com> wrote:

> > Great. Keep in mind that, since you have a UUID component at the front
> of your key, you are doing something like a random-write workload. So, as
> your data grows, if your PK column (and its bloom filters) ends up being
> larger than the available RAM for caching, each write may generate a disk
> seek which will make throughput plummet. This is unlike some other storage
> options like HBase which does "blind puts".
>
> > Just something to be aware of, for performance planning.
>
> Thanks for letting me know. I'll keep a note.
>
> > I think in 1.3 it was called "kudu test loadgen" and may have fewer
> options available.
>
> Cool. I just run it on one of the TS node ('kudu test loadgen 
> --num-threads=8 --num-rows-per-thread=100 --table-num-buckets=32'), and
> got the following:
>
> Generator report
>   time total  : 5434.15 ms
>   time per row: 0.000679268 ms
>
> ~1.5M / sec? looks good.
>


yep, sounds about right. My machines I was running on are relatively old
spec CPUs and also somewhat overloaded (it's a torture-test cluster of
sorts that is always way out of balance, re-replicating stuff, etc)

-Todd


>
>
>
> On Wed, Nov 1, 2017 at 1:40 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun <sunc...@uber.com> wrote:
>>
>>> Thanks Todd! I improved my code to use multi Kudu clients for processing
>>> the Kafka messages and
>>> was able to improve the number to 250K - 300K per sec. Pretty happy with
>>> this now.
>>>
>>
>> Great. Keep in mind that, since you have a UUID component at the front of
>> your key, you are doing something like a random-write workload. So, as your
>> data grows, if your PK column (and its bloom filters) ends up being larger
>> than the available RAM for caching, each write may generate a disk seek
>> which will make throughput plummet. This is unlike some other storage
>> options like HBase which does "blind puts".
>>
>> Just something to be aware of, for performance planning.
>>
>>
>>>
>>> Will take a look at the perf tool - looks very nice. It seems it is not
>>> available on Kudu 1.3 though.
>>>
>>>
>> I think in 1.3 it was called "kudu test loadgen" and may have fewer
>> options available.
>>
>> -Todd
>>
>> On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>>>
>>>>> Sounds good.
>>>>>
>>>>> BTW, you can try a quick load test using the 'kudu perf loadgen'
>>>>> tool.  For example something like:
>>>>>
>>>>> kudu perf loadgen my-kudu-master.example.com --num-threads=8
>>>>> --num-rows-per-thread=100 --table-num-buckets=32
>>>>>
>>>>> There are also a bunch of options to tune buffer sizes, flush options,
>>>>> etc. But with the default settings above on an 8-node cluster I have, I 
>>>>> was
>>>>> able to insert 8M rows in 44 seconds (180k/sec).
>>>>>
>>>>> Adding --buffer-size-bytes=1000 almost doubled the above
>>>>> throughput (330k rows/sec)
>>>>>
>>>>
>>>> One more quick datapoint: I ran the above command simultaneously (in
>>>> parallel) four times. Despite running 4x as many clients,  they all
>>>> finished in the same time as a single client did (ie aggregate throughput
>>>> ~1.2M rows/sec).
>>>>
>>>> Again this isn't a scientific benchmark, and it's such a short burst of
>>>> activity that it doesn't represent a real workload, but 15k rows/sec is
>>>> definitely at least an order of magnitude lower than the peak throughput I
>>>> would expect.
>>>>
>>>> -Todd
>>>>
>>>>
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>
>>>>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <t...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <sunc...@uber.com> wrote:
>>>>>>>
>>>>>>>> Thanks Zhen and Todd.
>>>>>>>>
>>>>>>>> Yes inc

Re: Kudu background tasks

2017-11-01 Thread Todd Lipcon
Hi Janne,

It's not clear whether the issue was that it was taking a long time to
restart (i.e replaying WALs) or if somehow you also ended up having to
re-replicate a bunch of tablets from host to host in the cluster. There
were some bugs in earlier versions of Kudu (eg KUDU-2125, KUDU-2020) which
could make this process rather slow to stabilize.

If this issue happens again, running 'kudu cluster ksck' during the
instable period can often yield more information to help understand what is
happening.

What version are you running?

Todd


On Wed, Nov 1, 2017 at 1:16 AM, Janne Keskitalo <janne.keskit...@paf.com>
wrote:

> Hi
>
> Our Kudu test environment got unresponsive yesterday for unknown reason.
> It has three tablet servers and one master. It's running in AWS on quite
> small host machines, so maybe some node ran out of memory or something. It
> has happened before with this setup. Anyway, after we restarted kudu
> service, we couldn't do any selects. From the tablet server UI I could see
> it was initializing and bootstrapping tablets. It took many hours until all
> tablets were in RUNNING-state.
>
> My question is where can I find information about these background
> operations? I want to understand what happens in situations when some node
> is offline and then comes back up after a while. What is tablet
> initialization and bootstrapping, etc.
>
> --
> Br.
> Janne Keskitalo,
> Database Architect, PAF.COM
> For support: dbdsupp...@paf.com
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Low ingestion rate from Kafka

2017-11-01 Thread Todd Lipcon
On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun <sunc...@uber.com> wrote:

> Thanks Todd! I improved my code to use multi Kudu clients for processing
> the Kafka messages and
> was able to improve the number to 250K - 300K per sec. Pretty happy with
> this now.
>

Great. Keep in mind that, since you have a UUID component at the front of
your key, you are doing something like a random-write workload. So, as your
data grows, if your PK column (and its bloom filters) ends up being larger
than the available RAM for caching, each write may generate a disk seek
which will make throughput plummet. This is unlike some other storage
options like HBase which does "blind puts".

Just something to be aware of, for performance planning.


>
> Will take a look at the perf tool - looks very nice. It seems it is not
> available on Kudu 1.3 though.
>
>
I think in 1.3 it was called "kudu test loadgen" and may have fewer options
available.

-Todd

On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> Sounds good.
>>>
>>> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
>>> For example something like:
>>>
>>> kudu perf loadgen my-kudu-master.example.com --num-threads=8
>>> --num-rows-per-thread=100 --table-num-buckets=32
>>>
>>> There are also a bunch of options to tune buffer sizes, flush options,
>>> etc. But with the default settings above on an 8-node cluster I have, I was
>>> able to insert 8M rows in 44 seconds (180k/sec).
>>>
>>> Adding --buffer-size-bytes=1000 almost doubled the above throughput
>>> (330k rows/sec)
>>>
>>
>> One more quick datapoint: I ran the above command simultaneously (in
>> parallel) four times. Despite running 4x as many clients,  they all
>> finished in the same time as a single client did (ie aggregate throughput
>> ~1.2M rows/sec).
>>
>> Again this isn't a scientific benchmark, and it's such a short burst of
>> activity that it doesn't represent a real workload, but 15k rows/sec is
>> definitely at least an order of magnitude lower than the peak throughput I
>> would expect.
>>
>> -Todd
>>
>>
>>>
>>> -Todd
>>>
>>>
>>>
>>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <t...@cloudera.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <sunc...@uber.com> wrote:
>>>>>
>>>>>> Thanks Zhen and Todd.
>>>>>>
>>>>>> Yes increasing the # of consumers will definitely help, but we also
>>>>>> want to test the best throughput we can get from Kudu.
>>>>>>
>>>>>
>>>>> Sure, but increasing the number of consumers can increase the
>>>>> throughput (without increasing the number of Kudu tablet servers).
>>>>>
>>>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>>>
>>>>>
>>>>>>
>>>>>> I think the default batch size is 1000 rows?
>>>>>>
>>>>>
>>>>> In manual flush mode, it's up to you to determine how big your batches
>>>>> are. It will buffer until you call 'Flush()'. So you could wait until
>>>>> you've accumulated way more than 1000 to flush.
>>>>>
>>>>>
>>>>>> I tested with a few different options between 1000 and 20, but
>>>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>>>> background mode and 32 hash partitions but results are similar.
>>>>>>
>>>>>
>>>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>>>
>>>>>
>>>>>> The primary key is UUID + some string column though - they always
>>>>>> come in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2,
>>>>>> etc.
>>>>>>
>>>>>
>>>>> Given this, are you hash-partitioning on just the UUID portion of the
>>>>> PK? ie if your PK is (uuid, timestamp

Re: Low ingestion rate from Kafka

2017-11-01 Thread Todd Lipcon
On Tue, Oct 31, 2017 at 11:56 PM, Chao Sun <sunc...@uber.com> wrote:

> > Sure, but increasing the number of consumers can increase the throughput
> (without increasing the number of Kudu tablet servers).
>
> I see. Make sense. I'll test that later.
>
> > Currently, if you run 'top' on the TS nodes, do you see them using a
> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
> utilization? My guess is that at 15k/sec you are hardly utilizing the
> nodes, and you're mostly bound by round trip latencies, etc.
>
> From the top and iostat commands, the TS nodes seem pretty under-utilized.
> CPU usage is less than 10%.
>
> > In manual flush mode, it's up to you to determine how big your batches
> are. It will buffer until you call 'Flush()'. So you could wait until
> you've accumulated way more than 1000 to flush.
>
> Got it. I meant the default buffer size is 1000 - found out that I need to
> bump this up in order to bypass "buffer is too big" error.
>
> > In your AUTO_FLUSH test, were you still calling Flush()?
>
> Yes.
>

OK,  in that case, the "Flush()" call is still a synchronous flush. So you
may want to only call Flush() infrequently.


>
> > Given this, are you hash-partitioning on just the UUID portion of the
> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
> UUID. This should ensure that you get pretty good batching of the writes.
>
> Yes, I only hash-partitioned on the UUID portion.
>

Sounds good.

BTW, you can try a quick load test using the 'kudu perf loadgen' tool.  For
example something like:

kudu perf loadgen my-kudu-master.example.com --num-threads=8
--num-rows-per-thread=100 --table-num-buckets=32

There are also a bunch of options to tune buffer sizes, flush options, etc.
But with the default settings above on an 8-node cluster I have, I was able
to insert 8M rows in 44 seconds (180k/sec).

Adding --buffer-size-bytes=1000 almost doubled the above throughput
(330k rows/sec)

-Todd



> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>>
>>
>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <sunc...@uber.com> wrote:
>>
>>> Thanks Zhen and Todd.
>>>
>>> Yes increasing the # of consumers will definitely help, but we also want
>>> to test the best throughput we can get from Kudu.
>>>
>>
>> Sure, but increasing the number of consumers can increase the throughput
>> (without increasing the number of Kudu tablet servers).
>>
>> Currently, if you run 'top' on the TS nodes, do you see them using a high
>> amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization?
>> My guess is that at 15k/sec you are hardly utilizing the nodes, and you're
>> mostly bound by round trip latencies, etc.
>>
>>
>>>
>>> I think the default batch size is 1000 rows?
>>>
>>
>> In manual flush mode, it's up to you to determine how big your batches
>> are. It will buffer until you call 'Flush()'. So you could wait until
>> you've accumulated way more than 1000 to flush.
>>
>>
>>> I tested with a few different options between 1000 and 20, but
>>> always got some number between 15K to 20K per sec. Also tried flush
>>> background mode and 32 hash partitions but results are similar.
>>>
>>
>> In your AUTO_FLUSH test, were you still calling Flush()?
>>
>>
>>> The primary key is UUID + some string column though - they always come
>>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>>>
>>
>> Given this, are you hash-partitioning on just the UUID portion of the PK?
>> ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID.
>> This should ensure that you get pretty good batching of the writes.
>>
>> Todd
>>
>>
>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> In addition to what Zhen suggests, I'm also curious how you are sizing
>>>> your batches in manual-flush mode? With 128 hash partitions, each batch is
>>>> generating 128 RPCs, so if for example you are only batching 1000 rows at a
>>>> time, you'll end up with a lot of fixed overhead in each RPC to insert just
>>>> 1000/128 = ~8 rows.
>>>>
>>>> Generally I would expect an 8 node cluster (even with HDDs) to be able
>>>> to sustain several hundred thousand rows/second insert rate. Of course, it
>>>> depends on the size of the rows and also the primary key you've chosen. If
>>>> your primary key i

Re: Low ingestion rate from Kafka

2017-10-31 Thread Todd Lipcon
In addition to what Zhen suggests, I'm also curious how you are sizing your
batches in manual-flush mode? With 128 hash partitions, each batch is
generating 128 RPCs, so if for example you are only batching 1000 rows at a
time, you'll end up with a lot of fixed overhead in each RPC to insert just
1000/128 = ~8 rows.

Generally I would expect an 8 node cluster (even with HDDs) to be able to
sustain several hundred thousand rows/second insert rate. Of course, it
depends on the size of the rows and also the primary key you've chosen. If
your primary key is generally increasing (such as the kafka sequence
number) then you should have very little compaction and good performance.

-Todd

On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zhqu...@gmail.com> wrote:

> Maybe you can add your consumer number? In my opinion, more threads to
> insert can give a better throughput.
>
> 2017-10-31 15:07 GMT+08:00 Chao Sun <sunc...@uber.com>:
>
>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>> sec. :)
>>
>> Is there any other tuning I can do to further improve this? and also, how
>> much would
>> SSD help in this case (only upsert)?
>>
>> Thanks again,
>> Chao
>>
>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>
>>> If you want to manage batching yourself you can use the manual flush
>>> mode. Easiest would be the auto flush background mode.
>>>
>>> Todd
>>>
>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>>
>>>> Hi Todd,
>>>>
>>>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>>>> For Kudu, I was doing something very simple that basically just follow
>>>> the example here
>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>> .
>>>> In specific:
>>>>
>>>> loop {
>>>>   Insert insert = kuduTable.newInsert();
>>>>   PartialRow row = insert.getRow();
>>>>   // fill the columns
>>>>   kuduSession.apply(insert)
>>>> }
>>>>
>>>> I didn't specify the flushing mode, so it will pick up the
>>>> AUTO_FLUSH_SYNC as default?
>>>> should I use MANUAL_FLUSH?
>>>>
>>>> Thanks,
>>>> Chao
>>>>
>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <t...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Chao,
>>>>>
>>>>> Nice to hear you are checking out Kudu.
>>>>>
>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>> would result in a separate round trip for each record and thus very low
>>>>> throughput.
>>>>>
>>>>> Todd
>>>>>
>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>>>>> partitioned into 128 buckets. However, with default settings, Kudu can 
>>>>> only
>>>>> consume the topics at a rate of around 1.5K / second. This is a direct
>>>>> ingest with no transformation on the data.
>>>>>
>>>>> Could this because I was using the default configurations? also we are
>>>>> using Kudu on HDD - could that also be related?
>>>>>
>>>>> Any help would be appreciated. Thanks.
>>>>>
>>>>> Best,
>>>>> Chao
>>>>>
>>>>>
>>>>>
>>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Low ingestion rate from Kafka

2017-10-31 Thread Todd Lipcon
If you want to manage batching yourself you can use the manual flush mode.
Easiest would be the auto flush background mode.

Todd

On Oct 30, 2017 11:10 PM, "Chao Sun" <sunc...@uber.com> wrote:

> Hi Todd,
>
> Thanks for the reply! I used a single Kafka consumer to pull the data.
> For Kudu, I was doing something very simple that basically just follow the
> example here
> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
> .
> In specific:
>
> loop {
>   Insert insert = kuduTable.newInsert();
>   PartialRow row = insert.getRow();
>   // fill the columns
>   kuduSession.apply(insert)
> }
>
> I didn't specify the flushing mode, so it will pick up the AUTO_FLUSH_SYNC
> as default?
> should I use MANUAL_FLUSH?
>
> Thanks,
> Chao
>
> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> Hey Chao,
>>
>> Nice to hear you are checking out Kudu.
>>
>> What are you using to consume from Kafka and write to Kudu? Is it
>> possible that it is Java code and you are using the SYNC flush mode? That
>> would result in a separate round trip for each record and thus very low
>> throughput.
>>
>> Todd
>>
>> On Oct 30, 2017 10:23 PM, "Chao Sun" <sunc...@uber.com> wrote:
>>
>> Hi,
>>
>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>> partitioned into 128 buckets. However, with default settings, Kudu can only
>> consume the topics at a rate of around 1.5K / second. This is a direct
>> ingest with no transformation on the data.
>>
>> Could this because I was using the default configurations? also we are
>> using Kudu on HDD - could that also be related?
>>
>> Any help would be appreciated. Thanks.
>>
>> Best,
>> Chao
>>
>>
>>
>


Re: Low ingestion rate from Kafka

2017-10-30 Thread Todd Lipcon
Hey Chao,

Nice to hear you are checking out Kudu.

What are you using to consume from Kafka and write to Kudu? Is it possible
that it is Java code and you are using the SYNC flush mode? That would
result in a separate round trip for each record and thus very low
throughput.

Todd

On Oct 30, 2017 10:23 PM, "Chao Sun"  wrote:

Hi,

We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
The data are coming from Kafka at a rate of around 30K / sec, and hash
partitioned into 128 buckets. However, with default settings, Kudu can only
consume the topics at a rate of around 1.5K / second. This is a direct
ingest with no transformation on the data.

Could this because I was using the default configurations? also we are
using Kudu on HDD - could that also be related?

Any help would be appreciated. Thanks.

Best,
Chao


Re: 答复: 答复: How kudu synchronize real-time records?

2017-10-26 Thread Todd Lipcon
What Helifu said is correct that writes are funneled through the leader.

Reads can either be through the leader (which can perform immediately with
full consistency) or at a follower. On a follower, the client can choose
between the following:

a) low consistency: read whatever the follower happens to have. Currently
this mode is called READ_LATEST in the source but it should probably be
called READ_ANYTHING or READ_INCONSISTENT. It reads "the latest thing that
this replica has".
b) snapshot consistency at current time: this may cause the follower to
wait until it has heard from the leader and knows that it is up-to-date as
of the time that the scan started. This gives the same guarantee as reading
from the leader but can add some latency
c) snapshot consistency in the past: given a timestamp, the follower can
know whether it is up-to-date as of that timestamp. If so, it can do a
consistent read immediately. Otherwise, it will have to wait, as above.

You can learn more about this in the recent blog post authored by David
Alves at: https://kudu.apache.org/2017/09/18/kudu-consistency-pt1.html
Also please check out the docs at:
https://kudu.apache.org/docs/transaction_semantics.html


Hope that helps
-Todd

On Thu, Oct 26, 2017 at 3:18 AM, helifu <hzhel...@corp.netease.com> wrote:

> Sorry for my mistake.
> The copy replica could be read by clients with below API in client.h:
>
> Status SetSelection(KuduClient::ReplicaSelection selection)
> WARN_UNUSED_RESULT;
>
> enum ReplicaSelection {
> LEADER_ONLY,  ///< Select the LEADER replica.
>
> CLOSEST_REPLICA,  ///< Select the closest replica to the
> client,
>   ///< or a random one if all replicas are
> equidistant.
>
> FIRST_REPLICA ///< Select the first replica in the list.
> };
>
>
> 何李夫
> 2017-04-10 16:06:24
>
> -邮件原件-
> 发件人: user-return-1102-hzhelifu=corp.netease@kudu.apache.org [mailto:
> user-return-1102-hzhelifu=corp.netease@kudu.apache.org] 代表 ??
> 发送时间: 2017年10月26日 13:50
> 收件人: user@kudu.apache.org
> 主题: Re: 答复: How kudu synchronize real-time records?
>
> Thanks for replying me.
>
> It helps a lot.
>
> 2017-10-26 12:29 GMT+09:00 helifu <hzhel...@corp.netease.com>:
> > Hi,
> >
> > Now the read/write operations are limited to the master replica(record1
> on node1), and the copy replica(record1 on node2/node3) can't be read/write
> by clients directly.
> >
> >
> > 何李夫
> > 2017-04-10 11:24:24
> >
> > -邮件原件-
> > 发件人: user-return-1100-hzhelifu=corp.netease@kudu.apache.org [mailto:
> user-return-1100-hzhelifu=corp.netease@kudu.apache.org] 代表 ??
> > 发送时间: 2017年10月26日 10:43
> > 收件人: user@kudu.apache.org
> > 主题: How kudu synchronize real-time records?
> >
> > Hi!
> >
> > I read from documents saying 'once kudu receives records from client it
> write those records into WAL (also does replica)'
> >
> > And i wonder it can be different time when load those records from WAL
> in each node.
> > So let's say node1 load record1 from WAL at t1, node2 t2, node3 t3 (t1 <
> t2 < t3) then reading client attached node1 can see record but other
> reading clients attached not node1(node2, node3) have possibilities missing
> record1.
> >
> > I think that does not happens in kudu, and i wonder how kudu synchronize
> real time data.
> >
> > Thanks!
> >
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: kudu 1.4 kerberos

2017-10-24 Thread Todd Lipcon
On Tue, Oct 24, 2017 at 12:41 PM, Todd Lipcon <t...@cloudera.com> wrote:

> I've filed https://issues.apache.org/jira/browse/KUDU-2198 to provide a
> workaround for systems like this. I should have a patch up shortly since
> it's relatively simple.
>
>
... and here's the patch, if you want to try it out, Matteo:
https://gerrit.cloudera.org/c/8373/

-Todd


> -Todd
>
> On Tue, Oct 17, 2017 at 7:00 PM, Brock Noland <br...@phdata.io> wrote:
>
>> Just one clarification below...
>>
>> > On Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto <
>> m.durighe...@miriade.it> wrote:
>> > the "abcdefgh1234" it's an example of the the string created by the
>> cloudera manager during the enable kerberos.
>>
>> ...
>>
>> On Mon, Oct 16, 2017 at 11:57 PM, Todd Lipcon <t...@cloudera.com> wrote:
>> > Interesting. What is the sAMAccountName in this case? Wouldn't all of
>> the 'kudu' have the same account name?
>>
>> CM generates some random names for cn and sAMAccountName. Below is an
>> example created by CM.
>>
>> dn: CN=uQAtUOSwrA,OU=valhalla-kerberos,OU=Hadoop,DC=phdata,DC=io
>> cn: uQAtUOSwrA
>> sAMAccountName: uQAtUOSwrA
>> userPrincipalName: kudu/worker5.valhalla.phdata...@phdata.io
>> servicePrincipalName: kudu/worker5.valhalla.phdata.io
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: kudu 1.4 kerberos

2017-10-24 Thread Todd Lipcon
I've filed https://issues.apache.org/jira/browse/KUDU-2198 to provide a
workaround for systems like this. I should have a patch up shortly since
it's relatively simple.

-Todd

On Tue, Oct 17, 2017 at 7:00 PM, Brock Noland <br...@phdata.io> wrote:

> Just one clarification below...
>
> > On Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto <
> m.durighe...@miriade.it> wrote:
> > the "abcdefgh1234" it's an example of the the string created by the
> cloudera manager during the enable kerberos.
>
> ...
>
> On Mon, Oct 16, 2017 at 11:57 PM, Todd Lipcon <t...@cloudera.com> wrote:
> > Interesting. What is the sAMAccountName in this case? Wouldn't all of
> the 'kudu' have the same account name?
>
> CM generates some random names for cn and sAMAccountName. Below is an
> example created by CM.
>
> dn: CN=uQAtUOSwrA,OU=valhalla-kerberos,OU=Hadoop,DC=phdata,DC=io
> cn: uQAtUOSwrA
> sAMAccountName: uQAtUOSwrA
> userPrincipalName: kudu/worker5.valhalla.phdata...@phdata.io
> servicePrincipalName: kudu/worker5.valhalla.phdata.io
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: [DISCUSS] Move Slack discussions to ASF official slack?

2017-10-23 Thread Todd Lipcon
On Mon, Oct 23, 2017 at 4:12 PM, Misty Stanley-Jones <mi...@apache.org>
wrote:

> 1.  I have no idea, but you could enable the @all at-mention in the
> eisting #kudu-general and let people know that way. Also see my next answer.
>
>
Fair enough.


> 2.  It looks like if you have an apache.org email address you don't need
> an invite, but otherwise an existing member needs to invite you. If you can
> somehow get all the member email addresses, you can invite them all at once
> as a comma-separated list.
>

I'm not sure if that's doable but potentially.

I'm concerned though if we don't have auto-invite for arbitrary community
members who just come by a link from our website. A good portion of our
traffic is users, rather than developers, and by-and-large they don't have
apache.org addresses. If we closed the Slack off to them I think we'd lose
a lot of the benefit.


>
> 3.  I can't tell what access there is to integrations. I can try to find
> out who administers that on ASF infra and get back with you. I would not be
> surprised if integrations with the ASF JIRA were already enabled.
>
> I pre-emptively grabbed #kudu on the ASF slack in case we decide to go
> forward with this. If we don't decide to go forward with it, it's a good
> idea to hold onto the channel and pin a message in there about how to get
> to the "official" Kudu slack.
>
> On Mon, Oct 23, 2017 at 3:00 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> A couple questions about this:
>>
>> - is there any way we can email out to our existing Slack user base to
>> invite them to move over? We have 866 members on our current slack and
>> would be a shame if people got confused as to where to go for questions.
>>
>> - does the ASF slack now have a functioning self-serve "auto-invite"
>> service?
>>
>> - will we still be able to set up integrations like JIRA/github?
>>
>> -Todd
>>
>> On Mon, Oct 23, 2017 at 2:53 PM, Misty Stanley-Jones <mi...@apache.org>
>> wrote:
>>
>>> When we first started using Slack, I don't think the ASF Slack instance
>>> existed. Using our own Slack instance means that we have limited access
>>> to
>>> message archives (unless we pay) and that people who work on multiple ASF
>>> projects need to add the Kudu slack in addition to any other Slack
>>> instances they may be on. I propose that we instead create one or more
>>> Kudu-related channels on the official ASF slack (
>>> http://the-asf.slack.com/)
>>> and migrate our discussions there. What does everyone think?
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: [DISCUSS] Move Slack discussions to ASF official slack?

2017-10-23 Thread Todd Lipcon
A couple questions about this:

- is there any way we can email out to our existing Slack user base to
invite them to move over? We have 866 members on our current slack and
would be a shame if people got confused as to where to go for questions.

- does the ASF slack now have a functioning self-serve "auto-invite"
service?

- will we still be able to set up integrations like JIRA/github?

-Todd

On Mon, Oct 23, 2017 at 2:53 PM, Misty Stanley-Jones <mi...@apache.org>
wrote:

> When we first started using Slack, I don't think the ASF Slack instance
> existed. Using our own Slack instance means that we have limited access to
> message archives (unless we pay) and that people who work on multiple ASF
> projects need to add the Kudu slack in addition to any other Slack
> instances they may be on. I propose that we instead create one or more
> Kudu-related channels on the official ASF slack (http://the-asf.slack.com/
> )
> and migrate our discussions there. What does everyone think?
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Change Data Capture (CDC) with Kudu

2017-09-29 Thread Todd Lipcon
the primary one. The two
>>>> instances do not have to be identical in real-time (i.e. we are not looking
>>>> for synchronous writes to Kudu), but we would like to have some pretty good
>>>> confidence that the secondary instance contains all the changes that the
>>>> primary has up to say an hour before (or something like that).
>>>>
>>>>
>>>> So far we considered a couple of options:
>>>> - refreshing the seconday instance with a full copy of the primary one
>>>> every so often, but that would mean having to transfer say 50TB of data
>>>> between the two locations every time, and our network bandwidth constraints
>>>> would prevent to do that even on a daily basis
>>>> - having a column that contains the most recent time a row was updated,
>>>> however this column couldn't be part of the primary key (because the
>>>> primary key in Kudu is immutable), and therefore finding which rows have
>>>> been changed every time would require a full scan of the table to be
>>>> sync'd. It would also rely on the "last update timestamp" column to be
>>>> always updated by the application (an assumption that we would like to
>>>> avoid), and would need some other process to take into accounts the rows
>>>> that are deleted.
>>>>
>>>>
>>>> Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of
>>>> 'Change Data Capture' mechanism where only the 'deltas' are captured and
>>>> applied to the secondary instance, we were wondering if there's any way in
>>>> Kudu to achieve something like that (possibly mining the WALs, since my
>>>> understanding is that each change gets applied to the WALs first).
>>>>
>>>>
>>>> Thanks,
>>>> Franco Venturi
>>>>
>>>
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Please tell me about License regarding kudu logo usage

2017-09-19 Thread Todd Lipcon
Oops, adding the original poster in case he or she is not subscribed to the
list.

On Sep 19, 2017 10:46 PM, "Todd Lipcon" <t...@cloudera.com> wrote:

> Hi Yuya,
>
> There should be no problem to use the Apache Kudu logo in your conference
> slides, assuming you are just using as intended to describe or refer to the
> project itself. This is considered "nominative use" under trademark laws.
>
> You can read more about nominative use at:
> https://www.apache.org/foundation/marks/#principles
>
> Thanks for including Kudu in your upcoming talk! I hope you will share the
> slides with the community.
>
> Todd
>
>
>
> On Sep 19, 2017 10:43 PM, "野口 裕也" <noguchi-y...@dmm.com> wrote:
>
> Dear Team Kudu
>
> I want to use Apache Kudu Logo in my session of JAPAN PHP CONFERENCE 2017
>
> Can I use it in my session?
>
> Please tell me about License regarding kudu logo usage.
>
> I look forward to hearing from you.
>
> Best regards,
> yuya
>
> --
>
>
>


Re: Please tell me about License regarding kudu logo usage

2017-09-19 Thread Todd Lipcon
Hi Yuya,

There should be no problem to use the Apache Kudu logo in your conference
slides, assuming you are just using as intended to describe or refer to the
project itself. This is considered "nominative use" under trademark laws.

You can read more about nominative use at:
https://www.apache.org/foundation/marks/#principles

Thanks for including Kudu in your upcoming talk! I hope you will share the
slides with the community.

Todd



On Sep 19, 2017 10:43 PM, "野口 裕也"  wrote:

Dear Team Kudu

I want to use Apache Kudu Logo in my session of JAPAN PHP CONFERENCE 2017

Can I use it in my session?

Please tell me about License regarding kudu logo usage.

I look forward to hearing from you.

Best regards,
yuya

--


Re: Composite primary key

2017-09-05 Thread Todd Lipcon
Hi Janne,

This is a good interesting question.

If you never plan on actually querying based on those columns themselves,
concatenating them into a binary column as the single PK will save a bit of
space relative to storing them separately. In the case of a composite
primary key, Kudu will internally encode a binary concatenated column and
store it using prefix encoding. So, if you store them separately, you'll
get the same composite binary encoding plus the additional storage for the
separate columns.

However, if you have any use case for querying based on them, having the
separate columns would be quite useful, since Kudu can push down predicates
to individual columns.

Being able to use the subfields for partitioning is also likely to be
useful - eg you might want to hash-partition on 'topic+partition' together
so that all data for a given topic always ends up stored together. This
wouldn't be possible if you use a combined (manually-encoded) key.

-Todd

On Fri, Aug 25, 2017 at 11:10 PM, Janne Keskitalo <janne.keskit...@paf.com>
wrote:

> Hi
>
> We're inserting messages from kafka into kudu tables and some messages
> don't have a natural primary key, hence we decided to use kafka
> topic/partition/offset -combination as the key. Is it better to concatenate
> the fields into one kudu column or create a separate column for each? Do we
> get better compression if using individual columns? And is the PK index
> structure maintained outside of the actual table data?
>
> --
> Br.
> Janne Keskitalo,
> Database Architect, PAF.COM
> For support: dbdsupp...@paf.com
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Question about per server data upper limit.

2017-09-05 Thread Todd Lipcon
Thanks Li Jin for reporting back your experiences!

Kudu 1.5 also has more improvements for data density, so if you want to try
testing the Kudu 1.5.0 RC3 release candidate in your environment, that
would be great.

-Todd

On Mon, Sep 4, 2017 at 7:41 PM, Li Jin <yuyunliu...@gmail.com> wrote:

> Thanks replay.that is to say. there is no hard limit about ts' data, the
> more data just inc the time of start up, and we need more resource, such as
> tablets,tserver's thread count,file descriptor
> count. may be the upper limit is not ts, but something others.
> by the way, we test data is more than 6T per ts, and it's work well now.
> qps is more than 50w . kudu 1.4.0 have do many optimize, include the
> recommend the upper limit of ts data per ts 8T . we are continue to test
> util the data reaches 8T or more. it's interesting.
> our machine configure and kudu version:
> 32 cpu  Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz   128G memory 6*16T hdd
> for data and 3T for wal. kudu 1.4.0  5 master + 5 tserver.
> if more interesting things happened, I will replay here.
> thanks again.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Configure Impala for Kudu on Separate Cluster

2017-08-14 Thread Todd Lipcon
gt; received: org.apache.kudu.client.RecoverableException: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [23082ms] querying master, [23082ms] Sub rpc: ConnectToMaster sending RPC
> to server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [23083ms]
> Sub rpc: ConnectToMaster received from server master-prod-dc1-datanode151.
> pdc1i.gradientx.com:7051 response Network error: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [23086ms] delaying RPC due to Service unavailable: Master config (
> prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader. Exceptions
> received: org.apache.kudu.client.RecoverableException: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [26062ms] querying master, [26063ms] Sub rpc: ConnectToMaster sending RPC
> to server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [26064ms]
> Sub rpc: ConnectToMaster received from server master-prod-dc1-datanode151.
> pdc1i.gradientx.com:7051 response Network error: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [26067ms] delaying RPC due to Service unavailable: Master config (
> prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader. Exceptions
> received: org.apache.kudu.client.RecoverableException: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [29721ms] querying master, [29722ms] Sub rpc: ConnectToMaster sending RPC
> to server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [29723ms]
> Sub rpc: ConnectToMaster received from server master-prod-dc1-datanode151.
> pdc1i.gradientx.com:7051 response Network error: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [29726ms] delaying RPC due to Service unavailable: Master config (
> prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader. Exceptions
> received: org.apache.kudu.client.RecoverableException: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [30141ms] querying master, [30143ms] Sub rpc: ConnectToMaster sending RPC
> to server master-prod-dc1-datanode151.pdc1i.gradientx.com:7051, [30144ms]
> Sub rpc: ConnectToMaster received from server master-prod-dc1-datanode151.
> pdc1i.gradientx.com:7051 response Network error: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [30147ms] delaying RPC due to Service unavailable: Master config (
> prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader. Exceptions
> received: org.apache.kudu.client.RecoverableException: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed,
> [33361ms] trace too long, truncated)
> CAUSED BY: NoLeaderFoundException: Master config (
> prod-dc1-datanode151.pdc1i.gradientx.com:7051) has no leader. Exceptions
> received: org.apache.kudu.client.RecoverableException: [Peer
> master-prod-dc1-datanode151.pdc1i.gradientx.com:7051] Connection closed
> CAUSED BY: RecoverableException: [Peer master-prod-dc1-datanode151.
> pdc1i.gradientx.com:7051] Connection closed
>
> I also added to Impala Command Line Argument Advanced Configuration
> Snippet (Safety Valve) the setting:
> -kudu_master_hosts=prod-dc1-datanode151.pdc1i.gradientx.com:7051
>
> And to Impala Service Environment Advanced Configuration Snippet (Safety
> Valve) the setting:
> IMPALA_KUDU=1
>
> I don't think these are correct because it's not working.
>
> Cheers,
> Ben
>
>
> On Mon, Aug 14, 2017 at 9:58 PM Todd Lipcon <t...@cloudera.com> wrote:
>
>> Hi Ben,
>>
>> What error are you getting? The Impala shell shouldn't be relevant here
>> -- it's just submitting queries to the Impala daemons themselves.
>>
>> This deployment scenario hasn't been tested much as far as I know, so I
>> wouldn't be surprised if there are issues with scheduling fragments given
>> lack of locality, etc, but I would expect it to at least "basically work".
>> (I've done it once or twice to copy data from one cluster to another)
>>
>> -Todd
>>
>> On Mon, Aug 14, 2017 at 7:59 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Can someone help me with configuring Impala using Cloudera Manager for
>>> Kudu 1.4.0 on CDH 5.12.0? I cannot get it to connect using impala shell.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Any plans for "Aggregation Push down" or integrating Impala + Kudu more tightly?

2017-06-29 Thread Todd Lipcon
Hey Jason,

Answers inline below

On Thu, Jun 29, 2017 at 2:52 AM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi,
>
> Q1.
>
> After reading Druid vs Kudu
> <http://druid.io/docs/latest/comparisons/druid-vs-kudu.html>, I wondered
> Druid has aggregation push down.
>
> *Druid includes its own query layer that allows it to push down
>> aggregations and computations directly to data nodes for faster query
>> processing. *
>
>
> If I understand "Aggregation Push down" correctly, it seems that partial
> aggregation is done by data node side, so that only small amount of result
> set can be transferred to a client which could lead to great performance
> gain. (Am I right?)
>

That's right.


>
> So I wanted to know if Apache Kudu has a plan for Aggregation push down
> scan feature (Or already has it)
>

It currently doesn't have it, and there aren't any current plans to do so.

Usually, we assume that Kudu tablet servers are collocated with either
Impala daemons or Spark executors. So, it's less important to provide
pushdown into the tablet server itself, since even without it, we are
typically avoiding any network transfer from the TS into the execution
environment which does the aggregation.

The above would be less true if there were some way in which Kudu itself
could perform the aggregation more efficiently based on knowledge of its
underlying storage. For example, one could imagine doing a GROUP BY
'foo_col' more efficiently within Kudu if the column is dictionary-encoded
by aggregating on the code-words rather than the resulting decoded strings,
since the integer code words are fixed length and faster to hash, compare,
etc.

That said, it hasn't been a high priority relative to other performance
areas we're exploring.


>
> Q2.
>
> One thing that I concern when using Impala+Kudu is that all matching rows
> should transferred to impala process from kudu tserver. Usually Impala and
> Kudu tserver run on same node. So It would be happy If Impala can read Kudu
> Tablet directly. Any plan for this kind of features?
>
> How-to: Use Impala and Kudu Together for Analytic Workloads
> <https://blog.cloudera.com/blog/2016/04/how-to-use-impala-and-kudu-together-for-analytic-workloads/>
> says that:
>
> *we intend to implement the Apache Arrow in-memory data format and to
>> share memory between Kudu and Impala, which we expect will help with
>> performance and resource usage.*
>>
>
> What does "share memory between Kudu and Impala"? Does this already
> implemented?
>

Yes, currently all matching rows are transferred from the Kudu TS to the
Impala daemon. Impala schedules scanners for locality, though, so this is a
localhost-scoped connection which is quite fast. To give you a sense of the
speed, I just tested a localhost TCP connection using 'iperf' and measured
~6GB/sec on a single core. Although this is significantly slower than a
within-process memcpy, it's still fast enough that it usually represents a
small fraction of the overall CPU consumption of a query.

Regarding sharing memory, the first step which is already implemented is to
share a common in-memory layout. That is to say, the format in which Kudu
returns rows over the wire to the client matches the same format that
Impala expects its rows in memory. So, when it receives a block of rows in
the scanner, it doesn't have to "parse" or "convert" them to some other
format. It can simply interpret the data in place. This saves a lot of CPU.

Using something like Arrow would be even more efficient than the current
format since it is columnar rather than row-oriented. However, Impala
currently does not use a columnar format for its operator pipeline, so we
can't currently make use of Arrow to optimize the interchange.

Currently, as mentioned above, the data (in the common format) is
transferred from Kudu to Impala via a localhost TCP socket. A few years ago
we had an intern who experimented with using a Unix domain socket and found
some small speedup. He also experimented with setting up a shared memory
region and also found another small speedup over the domain socket.
However, there was a lot of complexity involved in this code (particularly
the shared memory approach) relative to the gain that we saw, so we didn't
end up merging it before his internship ended :)

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Time travel reads in Kudu

2017-06-18 Thread Todd Lipcon
Just to illustrate, I wrote a quick python script that shows the behavior:
https://gist.github.com/toddlipcon/385fcf4211f83e4968be3401db3147ba

The script runs your scenario of insert, update, update, delete, and then
scans at each of the times between the operations. The output on my machine
(running against a local tserver) is:
scan at datetime.datetime(2017, 6, 18, 21, 35, 20, 594427): [(1, 'v1')]
scan at datetime.datetime(2017, 6, 18, 21, 35, 20, 595743): [(1, 'v2')]
scan at datetime.datetime(2017, 6, 18, 21, 35, 20, 597093): [(1, 'v3')]
scan at datetime.datetime(2017, 6, 18, 21, 35, 20, 598470): []

Note that this example script is relying on the local clock instead of the
propagated timestamps, so it might not work correctly against a cluster
(the server side may have clock skew relative to the local machine where
the script is running). If you need it to work including clock skew, you'll
have to use the more advanced APIs to retrieve propagated timestamps from
the server side after each write.

-Todd


On Sun, Jun 18, 2017 at 1:36 PM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi Ananth,
>
> Answers inline below
>
> On Sat, Jun 17, 2017 at 1:40 PM, Ananth G <ananthg.a...@gmail.com> wrote:
>
>> Hello All,
>>
>> I was wondering if the following is possible as a time travel read in
>> Kudu.
>
>
>> Assuming T stands for the timestamp at which the record has been
>> committed, I have one insert for a given row @T1 followed by 3 updates at
>> time stamps @T2,@T3 and @T4. Finally the row was deleted at @T5.  ( T1 < T2
>> < T3 < T4 < T5 in terms of timestamps). Representing these values of this
>> row as V, the following is the state of values of this row.
>>
>> T1 -> V1 ( original insert )
>> T2 -> V2 ( first update )
>> T3 -> V3 ( second update )
>> T4 -> V4 ( third update )
>> T5 -> V5 ( Tombstone/delete )
>>
>> Now I want to perform a read scan. I am using the READ_AT_SNAPSHOT mode
>> and using setSnapShotMicros()  method to perform the read at that snapshot.
>> I was wondering if I would have the flexibility to get the following values
>> provided I am using the snapshot times as follows :
>>
>> 1. Can I get value V2 if I set snapshot time as t2 provided T2< t2 < T3 ?
>>
> yes
>
>
>> 2. Can I get value V3 if I set snapshot time as t3 provided T3 < t3 <  T4
>> ?
>>
>> yes
>
>
>> Also it is obvious for this to work properly  we will need two timestamps
>> as part of the API call ( lower and upper bound ) to retrieve value V2.
>> The usage of the word MVCC is interesting and hence this question.
>>
>
> I'm not following what you mean by a lower and upper bound timestamp? The
> READ_AT_SNAPSHOT setting means that you read the state of the table exactly
> as it was at the provided time. So, if you provide a time in between T2 and
> T3, you will see the value that was most recently committed before the
> specified time (i.e the value at T2)
>
>
>
>
>>
>> In other words, when we say Kudu has a MVCC style for data as an asset;
>> is it for all versions of the data mutation or just for the reconciliation
>> stage ? I am assuming it is only for the last stage of reconciliation (
>> i.e. until reads are fully committed ). Since timestamps in Kudu seem to be
>> for the lower bound markers, the above might not be possible but wanted to
>> check with the community.
>>
>
> It stores all history for a configurable amount of time
> (--tablet-history-max-age-sec, default 15 minutes). You can bump this to a
> longer amount of time.
>
>
>>
>> If it is otherwise , does the model hold good after a compaction is
>> performed ?
>>
>>
> Yes, as of version 1.2 (I think) the full history is properly retained
> regardless of any compactions, etc, subject to the above mentioned history
> limit.
>
> -Todd
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Time travel reads in Kudu

2017-06-18 Thread Todd Lipcon
Hi Ananth,

Answers inline below

On Sat, Jun 17, 2017 at 1:40 PM, Ananth G <ananthg.a...@gmail.com> wrote:

> Hello All,
>
> I was wondering if the following is possible as a time travel read in Kudu.


> Assuming T stands for the timestamp at which the record has been
> committed, I have one insert for a given row @T1 followed by 3 updates at
> time stamps @T2,@T3 and @T4. Finally the row was deleted at @T5.  ( T1 < T2
> < T3 < T4 < T5 in terms of timestamps). Representing these values of this
> row as V, the following is the state of values of this row.
>
> T1 -> V1 ( original insert )
> T2 -> V2 ( first update )
> T3 -> V3 ( second update )
> T4 -> V4 ( third update )
> T5 -> V5 ( Tombstone/delete )
>
> Now I want to perform a read scan. I am using the READ_AT_SNAPSHOT mode
> and using setSnapShotMicros()  method to perform the read at that snapshot.
> I was wondering if I would have the flexibility to get the following values
> provided I am using the snapshot times as follows :
>
> 1. Can I get value V2 if I set snapshot time as t2 provided T2< t2 < T3 ?
>
yes


> 2. Can I get value V3 if I set snapshot time as t3 provided T3 < t3 <  T4 ?
>
> yes


> Also it is obvious for this to work properly  we will need two timestamps
> as part of the API call ( lower and upper bound ) to retrieve value V2.
> The usage of the word MVCC is interesting and hence this question.
>

I'm not following what you mean by a lower and upper bound timestamp? The
READ_AT_SNAPSHOT setting means that you read the state of the table exactly
as it was at the provided time. So, if you provide a time in between T2 and
T3, you will see the value that was most recently committed before the
specified time (i.e the value at T2)




>
> In other words, when we say Kudu has a MVCC style for data as an asset; is
> it for all versions of the data mutation or just for the reconciliation
> stage ? I am assuming it is only for the last stage of reconciliation (
> i.e. until reads are fully committed ). Since timestamps in Kudu seem to be
> for the lower bound markers, the above might not be possible but wanted to
> check with the community.
>

It stores all history for a configurable amount of time
(--tablet-history-max-age-sec, default 15 minutes). You can bump this to a
longer amount of time.


>
> If it is otherwise , does the model hold good after a compaction is
> performed ?
>
>
Yes, as of version 1.2 (I think) the full history is properly retained
regardless of any compactions, etc, subject to the above mentioned history
limit.

-Todd


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: What does "Failed RPC negotiation" in kudu-tserver.WARNING

2017-06-17 Thread Todd Lipcon
No problem. We are here to help! We are glad to see your team using Kudu.

Todd

On Jun 17, 2017 7:24 PM, "Jason Heo"  wrote:

> Hi Jean-Daniel, Todd, and Alexey
>
> Thank your for the replies.
>
> Recently, I've experienced many issues but successfully resolved them with
> your helps. I really appreciate it.
>
> Regards,
>
> Jason
>


[ANNOUNCE] Apache Kudu 1.4.0 released

2017-06-15 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.4.0.

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

Apache Kudu 1.4.0 is a minor release which offers several new features,
improvements, optimizations, and bug fixes. Please see the release notes
for details.

Download it here: http://kudu.apache.org/releases/1.4.0/
Full release notes:
http://kudu.apache.org/releases/1.4.0/docs/release_notes.html

Regards,
The Apache Kudu team


Re: How to manage yearly range partition efficiently

2017-06-08 Thread Todd Lipcon
On Wed, Jun 7, 2017 at 7:44 PM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi.
>
> This is a partition strategy of my table.
>
> PARTITION BY HASH (...) PARTITIONS 40,
> RANGE (ymd) (
> PARTITION VALUES < "2015",
> PARTITION "2015" <= VALUES < "2016",
> PARTITION "2016" <= VALUES < "2017",
> PARTITION "2017" <= VALUES
> )
>
> My concern is that how to manage RANGE(ymd) partitions for greather than
> 2017.
>
> plan 1) using a cron job, add 2018 partition at the end of 2017, add 2019
> partition at the end of 2018, ...
> - pros: no unused partitions
> - cons: problems arise if next year's partition is not created by
> mistake
> plan 2) add all upcoming 10 years' partitions
> - pros: can reduce risks
> - cons: 400 partitions (40*10 years) are created but they has no data
>
> I prefer to plan 2) but I'm wondering what many unnecessarily partitions
> lead to problems.
>

The empty partitions will increase heartbeat traffic on your nodes, etc.
It's not major, but does add a bit of overhead. They will also participate
in any queries you run which are not able to prune partitions based on
range. Even though they have no data, there would be unnecessary
Spark/impala tasks/fragments running against the empty partitions, etc,
which may impact performance and concurrency.

I think I'd suggest plan 1, plus also put it on several people's calendars
to verify :) Alternatively, something like in 2017 add the partitions for
2018 and 2019, so you always maintain one extra year ahead and you are less
likely to "not notice" if the new one is not created in time.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Apache Apex supports kudu as a high throughput sink

2017-05-30 Thread Todd Lipcon
Hey Ananth,

Thanks for posting this, and for working on the Kudu sink for Apex.

One thing I wanted to note in the article:

"Kudu output operator allows the client side timestamps to be propagated to
the Kudu server where the mutation is executed. This allows for out of
sequence data tuples to be ordered on the server side. The following
snippet of code in the upstream operator shows how this can be done."
I think your understanding of the setPropagatedTimestamp() call is not
quite right. This timestamp propagation serves as a lower-bound for the
assigned timestamp at the server side, not as an exact setting of the
server side timestamp. Thus, if you perform two inserts, and the second
insert has a lower propagated timestamp, it does _not_ ensure that the
first one takes precedence. Since the Propagated Timestamp is a
lower-bound, the second insert will still be assigned a higher timestamp
than the first.

The purpose of this advanced API is to allow causal ordering to be
maintained between two writes. For example, imagine that client A writes
data from machine A, and then communicates with client B on machine B.
Then, client B performs a write. If we want to ensure that B's write is
assigned a higher timestamp than A, the setPropagatedTimestamp() API can
ensure that (by setting A's write's timestamp as the lower bound for B's
write). But, it can't be used to back-date a write as the article seems to
be implying.

Otherwise, the post is great! Thanks again for sharing your experience and
application.

-Todd

On Tue, May 30, 2017 at 11:33 AM, Ananth G <ananthg.a...@gmail.com> wrote:

> Hello All,
>
> Apache apex now enables low latency high throughput writes to Kudu as a
> sink. More details on this on the atrato blog here: http://www.atrato.io/
> blog/2017/05/28/apex-kudu-output/ . Please use the comments section to
> provide any feedback.
>
> Regards,
> Ananth
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Actual encoding and compression

2017-05-05 Thread Todd Lipcon
Hi Pavel,

It's worth noting that the fallback is fine-grained, not coarse-grained.

That is to say, if you choose AUTO (which is currently equivalent to DICT
for strings) and the fallback occurs, that fallback is at the granularity
of an individual column block, not the entire column. So there is no way to
display whether fallback occurred at a global level.

We could build some kind of tool to inspect the actual blocks on disk and
determine how many of them fell back to plain, but that would be a rather
expensive operation of scanning data, rather than just looking at some
piece of global metadata.

Hope that helps
-Todd

On Fri, May 5, 2017 at 12:58 AM, Pavel Martynov <mr.xk...@gmail.com> wrote:

> Hi, how can I see actual (effective) encoding and compression of table
> columns if I use defaults?
> Web UI shows me AUTO_ENCODING and DEFAULT_COMPRESSION, but as stated here
> https://kudu.apache.org/docs/schema_design.html#encoding Kudu may
> "transparently fall back to plain encoding" from dictionary encoding.
> I think it would be useful to the user to see actual used encoding &
> compression.
>
> --
> with best regards, Pavel Martynov
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Apache Kudu, Spark and StreamSets

2017-05-04 Thread Todd Lipcon
Hi Butch,

Thanks for helping to promote Kudu in your community.

Are these presentations open to the public? If so I can tweet about them
from the @ApacheKudu twitter account. We could also re-post your slides on
the Kudu blog.

-Todd

On Wed, May 3, 2017 at 12:18 AM, Quinto, Butch (AU - Melbourne) <
bqui...@deloitte.com.au> wrote:

> Hi All,
>
>
>
> Ex-Clouderan here. Recently joined Deloitte and now heading big data
> practice and partner relations with Cloudera in Australia and NZ.
>
>
>
> big fan of Kudu and been pushing adoption of Kudu within Deloitte and big
> data clients.
>
>
>
>
>
> I’m doing a series of presentations here in Australia:
>
>
>
> Wed, May 3rd, 2017. Deloitte Perth Analytics and Information Management
> Team. Perth, Australia
>
> *Apache Kudu, Spark and StreamSets. Presented by Butch Quinto*
>
>
>
> Fri, May 12th, 2017. Deloitte Melbourne Analytics and Information
> Management Team. Melbourne, Australia
>
> *Apache Kudu, Spark and StreamSets. Presented by Butch Quinto*
>
>
>
> More to follow…
>
>
>
>
>
>
>
> *Butch Quinto*
>
> Director | Analytics and Information Management
>
> Deloitte Consulting Pty Ltd
>
> 550 Bourke Street, Melbourne, VIC, 3000, Australia
>
> D: +61 3 9671 5433 <+61%203%209671%205433> | M: +61 402 430 736
> <+61%20402%20430%20736>
>
> bqui...@deloitte.com.au | www.deloitte.com.au
> <http://www2.deloitte.com/au/en.html?utm_source=outlook_medium=email_campaign=exc-outlook-signature_content=text>
>
>
>
> [image: cid:image002.png@01D27C6F.7D6E2430]
> <http://www2.deloitte.com/au/en.html?utm_source=outlook_medium=email_campaign=exc-outlook-signature_content=logo>
>
>
>
>
>
> This e-mail and any attachments to it are confidential. You must not use,
> disclose or act on the e-mail if you are not the intended recipient. If you
> have received this e-mail in error, please let us know by contacting the
> sender and deleting the original e-mail.
>
> Liability limited by a scheme approved under Professional Standards
> Legislation.
>
> Deloitte refers to one or more of Deloitte Touche Tohmatsu Limited, a UK
> private company limited by guarantee, and its network of member firms, each
> of which is a legally separate and independent entity. Please see
> www.deloitte.com.au/about <http://www.deloitte.com/au/about> for a
> detailed description of the legal structure of Deloitte Touche Tohmatsu
> Limited and its member firms. Nothing in this e-mail, nor any related
> attachments or communications or services, have any capacity to bind any
> other entity under the ‘Deloitte’ network of member firms (including those
> operating in Australia).
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: RPC and Service difference?

2017-04-28 Thread Todd Lipcon
To clarify one bit - the acceptor thread is the thread calling accept() on
the listening TCP socket. Once accepted, the RPC system uses libev
(event-based IO) to react to new packets on a "reactor thread". When a full
RPC request is received, it is distributed to the "service threads".

I'd also suggest running 'top -H -p $(pgrep kudu-tserver)' to see the
thread activity during the workload. You can see if one of the reactor
threads is hitting 100% CPU, for example, though I've never seen that to be
a bottleneck. David's pointers are probably good places to start
investigating.

-Todd

On Fri, Apr 28, 2017 at 1:41 PM, David Alves <davidral...@gmail.com> wrote:

> Hi
>
>   The acceptor thread only distributes work, it's very unlikely that is a
> bottleneck. Same goes for the number of workers, since the number of
> threads pulling data is defined by impala.
>   What is "extremely" slow in this case?
>
>   Some things to check:
>   It seems like this is scanning only 5 tablets? Are those all the tablets
> in per ts? Do tablets have roughly the same size?
>   Are you using encoding/compression?
>   How much data per tablet?
>   Have you ran "compute stats" on impala?
>
> Best
> David
>
>
>
> On Fri, Apr 28, 2017 at 9:07 AM, 기준 <0ctopus13pr...@gmail.com> wrote:
>
>> Hi!
>>
>> I'm using kudu 1.3, impala 2.7.
>>
>> I'm investigating about extreamly slow scan read in impala's profiling.
>>
>> So i digged source impala, kudu's source code.
>>
>> And i concluded this as a connection throughput problem.
>>
>> As i found out, impala use below steps to send scan request to kudu.
>>
>> 1. RunScannerThread -> Create new scan threads
>> 2. ProcessScanToken -> Open
>> 3. KuduScanner:GetNext
>> 4. Send Scan RPC -> Send scan rpc continuously
>>
>> So i checked kudu's rpc configurations.
>>
>> --rpc_num_acceptors_per_address=1
>> --rpc_num_service_threads=20
>> --rpc_service_queue_length=50
>>
>>
>> Here are my questions.
>>
>> 1. Does acceptor accept all rpc requests and toss those to proper service?
>> So, Scan rpc -> Acceptor -> RpcService?
>>
>> 2. If i want to increase input throughput then should i increase
>> '--rpc_num_service_threads' right?
>>
>> 3. Why '--rpc_num_acceptors_per_address' has so small value compared
>> to --rpc_num_service_threads? Because I'm going to increase that value
>> too, do you think this is a bad idea? if so can you plz describe
>> reason?
>>
>> Thanks for replying me!
>>
>> Have a nice day~ :)
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Bad insert performance of java kudu-client

2017-04-25 Thread Todd Lipcon
Hi Pavel,

That's a good find. It certainly does look like we could do caching of this
data. We use the local network interface address list to determine whether
a remote server is local or not.

In fact in many cases we are calling this we don't even care about the
result - it's just computed as a side effect of creating the 'ServerInfo'
object.

I filed KUDU-1982 to track this issue.

Any interest in working on a fix?

-Todd


On Tue, Apr 25, 2017 at 5:10 AM, Pavel Martynov <mr.xk...@gmail.com> wrote:

> I reproduce this problem with java.net.NetworkInterface.getByInetAddress
> and Windows on a few other machines. Also found this 'not an issue'
> http://bugs.java.com/view_bug.do?bug_id=7039343.
> Maybe kudu-client will use some memoization for this function?
>
> 2017-04-25 13:09 GMT+03:00 Pavel Martynov <mr.xk...@gmail.com>:
>
>> I figure out that problem was that I run this program on my development
>> Windows machine. It seems that there is some performance issue with
>> java.net.NetworkInterface.getByInetAddress on Windows (I found only that
>> http://stackoverflow.com/questions/35541870/java-networ
>> kinterface-getbyinetaddress-takes-way-too-long confirmation so far). See
>> profiler screenshot http://pasteboard.co/8uHil3I5H.png (kudu-client
>> v1.3.1), every call take 53 ms (!) on average.
>> Also, could you recheck logic, why this function recalls 88 times in 12
>> seconds for that small program?
>>
>> 2017-04-24 22:29 GMT+03:00 Todd Lipcon <t...@cloudera.com>:
>>
>>> I tried to reproduce this locally using your code and couldn't. I get
>>> around 100K inserts/second for 1.0, 1.1, 1.2, and 1.3 clients (against a
>>> 1.4-SNAPSHOT cluster)
>>>
>>> Is it always reproducible for you? eg if you switch back to the earlier
>>> client and try another set of runs, do you get the same results?
>>>
>>> -Todd
>>>
>>> On Mon, Apr 24, 2017 at 10:56 AM, Todd Lipcon <t...@cloudera.com> wrote:
>>>
>>>> I vaguely recall some bug in earlier versions of the Java client where
>>>> 'shutdown' wouldn't properly block on the data being flushed. So it's
>>>> possible in 1.0.x and below, you're not actually measuring the full amount
>>>> of time to write all the data, whereas when the bug is fixed, you are.
>>>>
>>>> I'll see if I can repro this locally as well using your code.
>>>>
>>>> -Todd
>>>>
>>>> On Mon, Apr 24, 2017 at 10:49 AM, David Alves <davidral...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Pavel
>>>>>
>>>>>   Interesting, Thanks for sharing those numbers.
>>>>>   I assume you weren't using AUTOFLUSH_BACKGROUND for the first
>>>>> versions you tested (don't think it was available then iirc).
>>>>>   Could you try without in the last version and see how the numbers
>>>>> compare?
>>>>>   We'd be happy to help track down the reason for this perf regression.
>>>>>
>>>>> Best
>>>>> David
>>>>>
>>>>> On Mon, Apr 24, 2017 at 4:58 AM, Pavel Martynov <mr.xk...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, I ran into the fact that I can not achieve high insertion speed
>>>>>> and I start to experiment with https://github.com/cloude
>>>>>> ra/kudu-examples/tree/master/java/insert-loadgen.
>>>>>> My slightly modified code (recreation of table on startup + duration
>>>>>> measuring): https://gist.github.com/xkrt/9405a2eeb98a56288b7
>>>>>> c5a7d817097b4.
>>>>>> On every run I change kudu-client version, results:
>>>>>>
>>>>>> kudu-client-ver  perf
>>>>>> 0.10 Duration: 626 ms, 79872/sec
>>>>>> 1.0.0Duration: 622 ms, 80385 inserts/sec
>>>>>> 1.0.1Duration: 630 ms, 79365 inserts/sec
>>>>>> 1.1.0Duration: 11703 ms, 4272 inserts/sec
>>>>>> 1.3.1Duration: 12317 ms, 4059 inserts/sec
>>>>>>
>>>>>> As can you see there was a great degradation between 1.0.1 and 1.1.0
>>>>>> (about a ~20 times!).
>>>>>> What could be a problem, how can I fix it? (actually I interested in
>>>>>> kudu-spark, so probably using of kudu-client 1.0.1 is not right 
>>>>>> solution?).
>>>>>>
>>>>>> My test cluster: 3 hosts with master and tserver on each (3 masters
>>>>>> and 3 tservers overall).
>>>>>> No extra settings, flags used:
>>>>>> fs_wal_dir
>>>>>> fs_data_dirs
>>>>>> master_addresses
>>>>>> tserver_master_addrs
>>>>>>
>>>>>>
>>>>>> --
>>>>>> with best regards, Pavel Martynov
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> with best regards, Pavel Martynov
>>
>
>
>
> --
> with best regards, Pavel Martynov
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Some bulk requests are missing when a tserver stopped

2017-04-24 Thread Todd Lipcon
I think it's also worth trying 'kudu cluster ksck -checksum_scan
<master1,master2,master3>' to perform a consistency check. This will ensure
that the available replicas have matching data (and uses the SNAPSHOT scan
mode to avoid the inconsistency that David mentioned above).

On Mon, Apr 24, 2017 at 2:38 PM, David Alves <davidral...@gmail.com> wrote:

> Hi Jason
>
>   What do you mean that 2% are missing? Were you not able to insert them
> (got a timeout) or where there no errors but you can't see the rows as the
> result of a scan?
>   How are you checking that all the rows are there? Through a regular scan
> in spark? In particular the default ReadMode for scans makes no guarantees
> about replica recency, so it might happen that when you kill a tablet
> server, the other chosen replica is not up-to-date and returns less rows.
> In this case it's not that the rows are missing just that the replica that
> served the scan doesn't have them yet.
>   These kinds of checks should likely be done with the READ_AT_SNAPSHOT
> ReadMode but even if you can't change ReadModes, do you still observe that
> rows are missing if you run the scans again?
>   Currently some throttling might be required to make sure that the
> clients don't overload the server with writes which causes writes to start
> timing out. More efficient bulk loads is something we're working on right
> now.
>
> Best
> David
>
>
> On Sat, Apr 22, 2017 at 6:48 AM, Jason Heo <jason.heo@gmail.com>
> wrote:
>
>> Hi.
>>
>> I'm using Apache Kudu 1.2. I'm currently testing high availability of
>> Kudu.
>>
>> During bulk loading, one tserver is stopped via CDH Manager intentionally
>> and 2% of rows are missing.
>>
>> I use Spark 1.6 and package org.apache.kudu:kudu-spark_2.10:1.1.0 for
>> bulk loading.
>>
>> I got a error several times during insertion. Although 2% is lost when
>> tserver is stop and not started again, If I start it right after stopped,
>> there was no loss even though I got same error messages.
>>
>>
>> I watched Comcast's recent presentation at Strata Hadoop, They said that
>>
>>
>> Spark is recommended for large inserts to ensure handling failures
>>>
>>>
>> I'm curious Comcast has no issues with tserver failures and how can I
>> prevent rows from being lost.
>>
>> --
>>
>> Below is an spark error message. ("01db64" is the killed one.)
>>
>>
>> java.lang.RuntimeException: failed to write 2 rows from DataFrame to
>> Kudu; sample errors: Timed out: RPC can not complete before timeout:
>> Batch{operations=2, tablet='1e83668a9fa44883897474eaa20a7cad'
>> [0x0001323031362D3036, 0x0001323031362D3037),
>> ignoreAllDuplicateRows=false, rpc=KuduRpc(method=Write,
>> tablet=1e83668a9fa44883897474eaa20a7cad, attempt=25,
>> DeadlineTracker(timeout=3, elapsed=29298), Traces: [0ms] sending RPC to
>> server 01d513bc5c1847c29dd89c3d21a1eb64, [589ms] received from server
>> 01d513bc5c1847c29dd89c3d21a1eb64 response Network error: [Peer
>> 01d513bc5c1847c29dd89c3d21a1eb64] Connection reset, [589ms] delaying RPC
>> due to Network error: [Peer 01d513bc5c1847c29dd89c3d21a1eb64] Connection
>> reset, [597ms] querying master, [597ms] Sub rpc: GetTableLocations sending
>> RPC to server 50cb634c24ef426c9147cc4b7181ca11, [599ms] Sub rpc:
>> GetTableLocations sending RPC to server 50cb634c24ef426c9147cc4b7181ca11,
>> [643ms
>> ...
>> ...
>> received from server 01d513bc5c1847c29dd89c3d21a1eb64 response Network
>> error: [Peer 01d513bc5c1847c29dd89c3d21a1eb64] Connection reset,
>> [29357ms] delaying RPC due to Network error: [Peer
>> 01d513bc5c1847c29dd89c3d21a1eb64] Connection reset)}
>> at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.
>> apply(KuduContext.scala:184)
>> at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.
>> apply(KuduContext.scala:179)
>> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfu
>> n$apply$33.apply(RDD.scala:920)
>> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfu
>> n$apply$33.apply(RDD.scala:920)
>> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC
>> ontext.scala:1869)
>> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkC
>> ontext.scala:1869)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> --
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Table size is not decreasing after large amount of rows deleted.

2017-04-24 Thread Todd Lipcon
Mike can correct me if wrong, but I think the background task in 1.3 is
only responsible for removing old deltas, and doesn't do anything to try to
trigger compactions on rowsets with a high percentage of deleted _rows_.

That's a separate bit of work that hasn't been started yet.

-Todd

On Sat, Apr 22, 2017 at 7:36 PM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi David.
>
> Thank you for your reply.
>
> I'll try to upgrade to 1.3 this week.
>
> Regards,
>
> Jason
>
> 2017-04-23 2:06 GMT+09:00 <davidral...@gmail.com>:
>
>> Hi Jason
>>
>>   In Kudu 1.2 if there are compactions happening, they will reclaim
>> space. Unfortunately the conditions for this to happen don't always
>> occur (if the portion of the keyspace where the deletions occurred
>> stopped receiving writes and was already fully compacted cleanup is
>> more unlikely)
>>   In Kudu 1.3 we added a background task to clean up old data even in
>> the absence of compactions. Could you upgrade?
>>
>> Best
>> David
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: tserver died during bulk indexing and dies again after restarting

2017-04-24 Thread Todd Lipcon
One other idea: if you can share a *redacted* WAL dump, that might also
help us understand the issue. For example:

$ kudu wal dump /data/1/kudu/wals//wal-1  --redact=all |
gzip - > /tmp/wal.txt.gz

The "--redact" flag will ensure that no cell data is present in the WAL.
For example, from one of our test clusters:

op 33: INSERT (int64 ps_partkey=, int64
ps_suppkey=, int64 ps_availqty=, double
ps_supplycost=, string ps_comment=)

You can view the resulting wal.txt.gz file before sending it to confirm
that nothing sensitive is included.

Thanks
-Todd


On Mon, Apr 24, 2017 at 10:39 AM, David Alves <davidral...@gmail.com> wrote:

> Hi Jason
>
>   No problem. Sorry if I misunderstood your previous email.
>   If you could share the log files themselves that would be great, if not,
> that's ok too.
>   You could use the kudu tool to delete the local replica for that tablet
> (without a running tserver daemon), but its likely that it's been gone a
> while and kicked out of most if not all consensus config, at which point,
> if all you data is available you could just delete the data and re-add it
> to the cluster.
>
> Best
> David
>
>
> On Mon, Apr 24, 2017 at 4:33 AM, Jason Heo <jason.heo@gmail.com>
> wrote:
>
>> Hi David.
>>
>> Thank you for your kind reply.
>>
>> I understood but I'm afraid I can't provide my WAL because it has
>> sensitive data, even via your private email.
>>
>> Regards,
>>
>> Jason
>>
>> 2017-04-24 15:12 GMT+09:00 David Alves <davidral...@gmail.com>:
>>
>>> Hi Jason
>>>
>>>   I meant the last wal segment for the 30aaccdf7c8c496a8ad73255856a1724
>>> tablet on the dead server (if you don't have sensitive data in there)
>>>   Not sure whether you specified the flag: "--fs_wal_dir". If so it
>>> should be in there, if not the wals are in the same dir as the value set
>>> for "--fs_data_dirs".
>>>   A wal file has a name like: "wal-1"
>>>
>>> Best
>>> David
>>>
>>>
>>> On Sat, Apr 22, 2017 at 7:46 PM, Jason Heo <jason.heo@gmail.com>
>>> wrote:
>>>
>>>> Hi David.
>>>>
>>>> Sorry for the insufficient information.
>>>>
>>>> There are 14 nodes in my test kudu cluster. Only one tserver has been
>>>> dead. It has only above two logs.
>>>>
>>>> Other 13 nodes has "Error trying to read ahead of the log while
>>>> preparing peer request: Incomplete: Op with" error 7~10 times.
>>>>
>>>> >> *Would it be possible to also get the WAL with the corrupted entry?*
>>>>
>>>> Would you please explain how to get it in more detail?
>>>>
>>>> I tried what I did again and again to reproduce same error, but it
>>>> didn't happen again.
>>>>
>>>> Please feel free to ask me for anything what you need to resolve.
>>>>
>>>> Regards,
>>>>
>>>> Jason
>>>>
>>>> 2017-04-23 1:56 GMT+09:00 <davidral...@gmail.com>:
>>>>
>>>>> Hi Jason
>>>>>
>>>>>   Anything else of interest in those logs? Can you share them (with
>>>>> just me, if you prefer)? Would it be possible to also get the WAL with
>>>>> the corrupted entry?
>>>>>   Did this happen on a single server?
>>>>>
>>>>> Best
>>>>> David
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


[ANNOUNCE] Apache Kudu 1.3.1 released

2017-04-19 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.3.1.

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

Apache Kudu 1.3.1 is a bug fix release which fixes critical issues
discovered in Apache Kudu 1.3.0. In particular, this fixes a bug in which
data could be incorrectly deleted after certain sequences of node failures.
Several other bugs are also fixed. See the release notes for details.

Users of Kudu 1.3.0 are encouraged to upgrade to 1.3.1 immediately.

Download it here: http://kudu.apache.org/releases/1.3.1/
Full release notes:
http://kudu.apache.org/releases/1.3.1/docs/release_notes.html

Regards,
The Apache Kudu team


Re: Building from Source fails on my CentOS 7.2

2017-04-17 Thread Todd Lipcon
Hmm. I think kudu should work fine against your old version, but you may
need to explicitly make sure it picks up the right path for all the related
libraries at link time (libkrb5, k5crypto, and krb5support). If it mixed
versions across these you will probably see errors.

Todd

On Apr 17, 2017 8:01 PM, "Jason Heo" <jason.heo@gmail.com> wrote:

> Hi Todd.
>
> Good point!
>
> It turned out that the customized krb5 library was pre-installed by my
> infrastructure team. and it is a old version.
>
> $ ldd /usr/lib64/libkrb5.so | grep k5cry
> libk5crypto.so.3 => /usr/path/to/lib/libk5crypto.so.3 (0x7f4f17b23000)
>
> Thanks,
>
> Jason.
>
> 2017-04-18 4:00 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>
>> Hi Jason,
>>
>> This is interesting. It seems like for some reason your libkrb5.so isn't
>> properly linekd against libkrb5support.so. On a fresh CentOS 7.3 system I
>> just booted, after installing krb5-devel packages, I see the symbols
>> defined in the expected libraries:
>>
>> [root@todd-el7 ~]# ldd /usr/lib64/libkrb5.so | grep k5cry
>> libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x7f8c68d34000)
>> [root@todd-el7 ~]# objdump -T /lib64/libk5crypto.so.3 | grep
>> enctype_to_name
>> 00019cd0 gDF .text  012b  k5crypto_3_MIT
>> krb5_enctype_to_name
>>
>> Do you have the MIT krb5 dev libraries installed, or is it possible you
>> have heimdal or some other krb5 implementation?
>>
>> -Todd
>>
>> On Thu, Apr 13, 2017 at 10:00 PM, Jason Heo <jason.heo@gmail.com>
>> wrote:
>>
>>> Hello.
>>>
>>> I'm using CentOS 7.2
>>>
>>> To build from Source Code, I followed the manual
>>> <https://kudu.apache.org/docs/installation.html#build_from_source> (except
>>> for Re Hat Developer Toolset because I use CentOS 7.2)
>>>
>>> Though I failed to compile :(
>>>
>>> ```
>>> ...
>>> [ 31%] Building CXX object src/kudu/master/CMakeFiles/mas
>>> ter.dir/sys_catalog.cc.o
>>> [ 31%] Building CXX object src/kudu/master/CMakeFiles/mas
>>> ter.dir/ts_descriptor.cc.o
>>> [ 31%] Building CXX object src/kudu/master/CMakeFiles/mas
>>> ter.dir/ts_manager.cc.o
>>> [ 31%] Linking CXX static library ../../../lib/libmaster.a
>>> [ 31%] Built target master
>>> [ 31%] Built target krb5_realm_override
>>> Scanning dependencies of target kudu-master
>>> [ 32%] Building CXX object src/kudu/master/CMakeFiles/kud
>>> u-master.dir/master_main.cc.o
>>> [ 32%] Linking CXX executable ../../../bin/kudu-master
>>> /usr/lib64/libkrb5.so: undefined reference to
>>> `krb5_enctype_to_name@k5crypto_3_MIT'
>>> /usr/lib64/libkrb5.so: undefined reference to
>>> `k5_buf_free@krb5support_0_MIT'
>>> /usr/lib64/libkrb5.so: undefined reference to
>>> `krb5int_utf8_to_ucs4@krb5support_0_MIT'
>>> ...
>>> ```
>>>
>>> BTW, I'm building from source code so that I use `kudu fs check`. Can I
>>> use "fs check" command if I checkout master branch?
>>>
>>> Versions
>>> ---
>>>
>>> $ rpm -qf /usr/lib64/libkrb5.so
>>> krb5-devel-1.14.1-27.el7_3.x86_64 <= Newest version for CentOS 7.2 yum
>>> repos.
>>>
>>> $ g++ --version
>>> g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
>>> Copyright (C) 2015 Free Software Foundation, Inc.
>>> This is free software; see the source for copying conditions.  There is
>>> NO
>>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>>> PURPOSE.
>>>
>>>
>>> Thanks,
>>>
>>> Jason.
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


Re: Building from Source fails on my CentOS 7.2

2017-04-17 Thread Todd Lipcon
Hi Jason,

This is interesting. It seems like for some reason your libkrb5.so isn't
properly linekd against libkrb5support.so. On a fresh CentOS 7.3 system I
just booted, after installing krb5-devel packages, I see the symbols
defined in the expected libraries:

[root@todd-el7 ~]# ldd /usr/lib64/libkrb5.so | grep k5cry
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x7f8c68d34000)
[root@todd-el7 ~]# objdump -T /lib64/libk5crypto.so.3 | grep enctype_to_name
00019cd0 gDF .text  012b  k5crypto_3_MIT
krb5_enctype_to_name

Do you have the MIT krb5 dev libraries installed, or is it possible you
have heimdal or some other krb5 implementation?

-Todd

On Thu, Apr 13, 2017 at 10:00 PM, Jason Heo <jason.heo@gmail.com> wrote:

> Hello.
>
> I'm using CentOS 7.2
>
> To build from Source Code, I followed the manual
> <https://kudu.apache.org/docs/installation.html#build_from_source> (except
> for Re Hat Developer Toolset because I use CentOS 7.2)
>
> Though I failed to compile :(
>
> ```
> ...
> [ 31%] Building CXX object src/kudu/master/CMakeFiles/
> master.dir/sys_catalog.cc.o
> [ 31%] Building CXX object src/kudu/master/CMakeFiles/
> master.dir/ts_descriptor.cc.o
> [ 31%] Building CXX object src/kudu/master/CMakeFiles/
> master.dir/ts_manager.cc.o
> [ 31%] Linking CXX static library ../../../lib/libmaster.a
> [ 31%] Built target master
> [ 31%] Built target krb5_realm_override
> Scanning dependencies of target kudu-master
> [ 32%] Building CXX object src/kudu/master/CMakeFiles/
> kudu-master.dir/master_main.cc.o
> [ 32%] Linking CXX executable ../../../bin/kudu-master
> /usr/lib64/libkrb5.so: undefined reference to `krb5_enctype_to_name@
> k5crypto_3_MIT'
> /usr/lib64/libkrb5.so: undefined reference to
> `k5_buf_free@krb5support_0_MIT'
> /usr/lib64/libkrb5.so: undefined reference to `krb5int_utf8_to_ucs4@
> krb5support_0_MIT'
> ...
> ```
>
> BTW, I'm building from source code so that I use `kudu fs check`. Can I
> use "fs check" command if I checkout master branch?
>
> Versions
> ---
>
> $ rpm -qf /usr/lib64/libkrb5.so
> krb5-devel-1.14.1-27.el7_3.x86_64 <= Newest version for CentOS 7.2 yum
> repos.
>
> $ g++ --version
> g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
> Copyright (C) 2015 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
>
> Thanks,
>
> Jason.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: How to flush `block_cache_capacity_mb` easily?

2017-04-17 Thread Todd Lipcon
Hey Jason,

Looks like approximately the right track. A few notes, thouhg:

- In the RPC implementation (TabletServiceImpl::ClearCache) make sure to
call rpc->RespondSuccess(). Otherwise the RPC will never respond, and the
client will time out (plus you'll leak some memory on the server)
- the authorization should probably be SuperUser
- in ClearCache, it doesn't seem like you're actually removing the elements
from the LRU list itself, so I think you'd likely crash soon after calling
the RPC. Also, the DCHECK in ClearCache() doesn't apply in the case that
you're triggering it administratively.
- ClearCache needs to lock the cache's mutex

Aside from the above specific issues, I think it would be good to add some
integration testing -- eg calling the ClearCache RPC against a tablet
server while a mixed workload is going on, probably with the cache
configured to be small enough that there is cache churn. We'd also want to
have a command line tool action like 'kudu tserver clear_cache' to trigger
this administratively.

-Todd

On Mon, Apr 17, 2017 at 5:21 AM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi, Todd.
>
> I've temporarily pushed this patch to my repository.
>
> https://github.com/jason-heo/kudu/commit/aff1fe181541671d2dc192ad9cb4ed
> 2172a51826
>
> Could you please check I'm on right track?
>
> It will take more time until pushing to cloudera's gerrit because I have
> yet to test if my modification works well and I'm not familiar with the
> contributing process <https://kudu.apache.org/docs/contributing.html>.
>
> Thanks,
>
> Jason
>
> 2017-04-11 12:55 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>
>> Sure. Here's a high-level overview of the approach:
>>
>> - in src/kudu/util/cache.h, you'll need to add a new method like
>> 'ClearCache'. In cache.cc and nvm_cache.cc you'll need to implement the
>> method. You could implement it for the NVM cache to just return
>> Status::NotSupported() if your main concern is the default (DRAM) cache.
>> - in tserver_service.proto, add a new RPC method called 'ClearCache'
>> - in tserver.proto, define its request/response protobufs. They can
>> probably be empty
>> - in tablet_service.h, tablet_service.cc implement the new method. It can
>> call through to BlockCache::GetInstance()->ClearCache() and then
>> RespondSuccess
>> - in tablet_server-test.cc add a test case which exercises this path
>>
>> Hope that helps
>>
>> -Todd
>>
>> On Mon, Apr 10, 2017 at 6:14 PM, Jason Heo <jason.heo@gmail.com>
>> wrote:
>>
>>> Great. I would be appreciated it if you guide me how can I contribute
>>> it. Then I'll try in my spare time.
>>>
>>> 2017-04-11 7:46 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>>>
>>>> On Sun, Apr 9, 2017 at 6:38 PM, Jason Heo <jason.heo@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Todd.
>>>>>
>>>>> I hope you had a good weekend.
>>>>>
>>>>> Exactly, I'm testing the latency of cold-cache reads from SATA disks
>>>>> and performance of difference schema designs as well.
>>>>>
>>>>> We currently using Elasticsearch for a analytic service. ES has a
>>>>> "clear cache API" feature, it makes me easy to test.
>>>>>
>>>>>
>>>> Makes sense. I don't think it would be particularly difficult to add
>>>> such an API. Any interest in contributing a patch? I'm happy to point you
>>>> in the right direction, if so.
>>>>
>>>> -Todd
>>>>
>>>>
>>>>> 2017-04-08 5:05 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>>>>>
>>>>>> Hey Jason,
>>>>>>
>>>>>> Can I ask what the purposes of the testing is?
>>>>>>
>>>>>> One thing to note is that we're currently leaving a fair bit of
>>>>>> performance on the table for cold-cache reads from spinning disks. So, if
>>>>>> you find that the performance is not satisfactory, it's worth being aware
>>>>>> that we will likely make some significant improvements in this area in 
>>>>>> the
>>>>>> future.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/KUDU-1289 has some details.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>> On Fri, Apr 7, 2017 at 8:44 AM, Dan Burkert <danburk...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> 

Re: Question about redistributing tablets on failure of a tserver.

2017-04-12 Thread Todd Lipcon
On Wed, Apr 12, 2017 at 9:45 PM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi Dan.
>
> I'm very happy to hear from you. Kudu is REALLY GREAT!
>
>
Thanks for the excitement! It's always great to hear when people are happy
with the project.


> About Q2:
>
> There are 14 tservers on my test cluster, each node has 3TB before
> re-replication and evenly distributed. Network bandwidth is 1Gbps.
>
> I have another question.
>
> Is it possible to re-replication cancel if failed tserver joins while
> re-replication goes on? re-joined tserver already has all data so I think
> re-replication is unnecessary and re-replication is a waste of time and
> resources. (This is what Elasticsearch behaves)
>

Yes, that's definitely something we'd like to do in the near future.

Right now our design is that when the leader notices a bad replica, it
ejects it from the Raft configuration, so we have a 2-node configuration.
We then immediately add a new replica and start making a tablet copy to it,
which may take some time with large tablets. During that time, if the old
node comes back, it is no longer part of the configuration and can't rejoin.

Mike Percy has started looking into changing the design to do something
more like:

- Original 3 nodes: A, B, C = VOTER
- node C dies
- add node D as a NON_VOTER/PRE_VOTER, and start the tablet copy
- if node C comes back up, remove D and cancel the tablet copy
- if node C is still not up when 'D' is available, evict C and convert D to
VOTER

Implementation isn't begun yet, but hopefully we can get this done in the
next couple of months (eg 1.4 or 1.5 release time line)

-Todd


>
> 2017-04-13 3:47 GMT+09:00 Dan Burkert <danburk...@apache.org>:
>
>> Hi Jason, answers inline:
>>
>> On Wed, Apr 12, 2017 at 5:53 AM, Jason Heo <jason.heo@gmail.com>
>> wrote:
>>
>>>
>>> Q1. Can I disable redistributing tablets on failure of a tserver? The
>>> reason why I need this is described in Background.
>>>
>>
>> We don't have any kind of built-in maintenance mode that would prevent
>> this, but it can be achieved by setting a flag on each of the tablet
>> servers.  The goal is not to disable re-replicating tablets, but instead to
>> avoid kicking the failed replica out of the tablet groups to begin with.
>> There is a config flag to control exactly that: 'evict_failed_followers'.
>> This isn't considered a stable or supported flag, but it should have the
>> effect you are looking for, if you set it to false on each of the tablet
>> servers, by running:
>>
>> kudu tserver set-flag  evict_failed_followers false
>> --force
>>
>> for each tablet server.  When you are done, set it back to the default
>> 'true' value.  This isn't something we routinely test (especially setting
>> it without restarting the server), so please test before trying this on a
>> production cluster.
>>
>> Q2. redistribution goes on even if the failed tserver reconnected to
>>> cluster. In my test cluster, it took 2 hours to distribute when a tserver
>>> which has 3TB data was killed.
>>>
>>
>> This seems slow.  What's the speed of your network?  How many nodes?  How
>> many tablet replicas were on the failed tserver, and were the replica sizes
>> evenly balanced?  Next time this happens, you might try monitoring with
>> 'kudu ksck' to ensure there aren't additional problems in the cluster (admin 
>> guide
>> on the ksck tool
>> <https://github.com/apache/kudu/blob/master/docs/administration.adoc#ksck>
>> ).
>>
>>
>>> Q3. `--follower_unavailable_considered_failed_sec` can be changed
>>> without restarting cluster?
>>>
>>
>> The flag can be changed, but it comes with the same caveats as above:
>>
>> 'kudu tserver set-flag  
>> follower_unavailable_considered_failed_sec
>> 900 --force'
>>
>>
>> - Dan
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: How to flush `block_cache_capacity_mb` easily?

2017-04-10 Thread Todd Lipcon
Sure. Here's a high-level overview of the approach:

- in src/kudu/util/cache.h, you'll need to add a new method like
'ClearCache'. In cache.cc and nvm_cache.cc you'll need to implement the
method. You could implement it for the NVM cache to just return
Status::NotSupported() if your main concern is the default (DRAM) cache.
- in tserver_service.proto, add a new RPC method called 'ClearCache'
- in tserver.proto, define its request/response protobufs. They can
probably be empty
- in tablet_service.h, tablet_service.cc implement the new method. It can
call through to BlockCache::GetInstance()->ClearCache() and then
RespondSuccess
- in tablet_server-test.cc add a test case which exercises this path

Hope that helps

-Todd

On Mon, Apr 10, 2017 at 6:14 PM, Jason Heo <jason.heo@gmail.com> wrote:

> Great. I would be appreciated it if you guide me how can I contribute it.
> Then I'll try in my spare time.
>
> 2017-04-11 7:46 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>
>> On Sun, Apr 9, 2017 at 6:38 PM, Jason Heo <jason.heo@gmail.com>
>> wrote:
>>
>>> Hi Todd.
>>>
>>> I hope you had a good weekend.
>>>
>>> Exactly, I'm testing the latency of cold-cache reads from SATA disks and
>>> performance of difference schema designs as well.
>>>
>>> We currently using Elasticsearch for a analytic service. ES has a "clear
>>> cache API" feature, it makes me easy to test.
>>>
>>>
>> Makes sense. I don't think it would be particularly difficult to add such
>> an API. Any interest in contributing a patch? I'm happy to point you in the
>> right direction, if so.
>>
>> -Todd
>>
>>
>>> 2017-04-08 5:05 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>>>
>>>> Hey Jason,
>>>>
>>>> Can I ask what the purposes of the testing is?
>>>>
>>>> One thing to note is that we're currently leaving a fair bit of
>>>> performance on the table for cold-cache reads from spinning disks. So, if
>>>> you find that the performance is not satisfactory, it's worth being aware
>>>> that we will likely make some significant improvements in this area in the
>>>> future.
>>>>
>>>> https://issues.apache.org/jira/browse/KUDU-1289 has some details.
>>>>
>>>> -Todd
>>>>
>>>> On Fri, Apr 7, 2017 at 8:44 AM, Dan Burkert <danburk...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> There is no command to have Kudu evict its block cache, but restarting
>>>>> the tablet server process will have that effect.  Ideally all written data
>>>>> will be flushed before the restart, otherwise startup/bootstrap will take 
>>>>> a
>>>>> bit longer. Flushing typically happens within 60s of the last write.
>>>>> Waiting for flush and compaction is also a best-practice for read-only
>>>>> benchmarks.  I'm not sure if someone else on the list has an easier way of
>>>>> determining when a flush happens, but I typically look at the 'MemRowSet'
>>>>> memory usage for the tablet on the /mem-trackers HTTP endpoint; it should
>>>>> show something minimal like 256B if it's fully flushed and empty.  You can
>>>>> also see details about how much memory is in the block cache on that page,
>>>>> if that interests you.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Thu, Apr 6, 2017 at 11:23 PM, Jason Heo <jason.heo@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I'm using Apache Kudu 1.2 on CDH 5.10.
>>>>>>
>>>>>> Currently, I'm doing a performance test of Kudu.
>>>>>>
>>>>>> Flushing OS Page Cache is easy, but I don't know how to flush
>>>>>> `block_cache_capacity_mb` easily.
>>>>>>
>>>>>> I currently execute SELECT statement over a unnecessarily table to
>>>>>> evict cached block of testing table.
>>>>>>
>>>>>> It is cumbersome, so I'd like to know is there a command for flushing
>>>>>> block caches (or another kudu's caches which I don't know yet)
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Regards,
>>>>>> Jason
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: How to flush `block_cache_capacity_mb` easily?

2017-04-10 Thread Todd Lipcon
On Sun, Apr 9, 2017 at 6:38 PM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi Todd.
>
> I hope you had a good weekend.
>
> Exactly, I'm testing the latency of cold-cache reads from SATA disks and
> performance of difference schema designs as well.
>
> We currently using Elasticsearch for a analytic service. ES has a "clear
> cache API" feature, it makes me easy to test.
>
>
Makes sense. I don't think it would be particularly difficult to add such
an API. Any interest in contributing a patch? I'm happy to point you in the
right direction, if so.

-Todd


> 2017-04-08 5:05 GMT+09:00 Todd Lipcon <t...@cloudera.com>:
>
>> Hey Jason,
>>
>> Can I ask what the purposes of the testing is?
>>
>> One thing to note is that we're currently leaving a fair bit of
>> performance on the table for cold-cache reads from spinning disks. So, if
>> you find that the performance is not satisfactory, it's worth being aware
>> that we will likely make some significant improvements in this area in the
>> future.
>>
>> https://issues.apache.org/jira/browse/KUDU-1289 has some details.
>>
>> -Todd
>>
>> On Fri, Apr 7, 2017 at 8:44 AM, Dan Burkert <danburk...@apache.org>
>> wrote:
>>
>>> Hi Jason,
>>>
>>> There is no command to have Kudu evict its block cache, but restarting
>>> the tablet server process will have that effect.  Ideally all written data
>>> will be flushed before the restart, otherwise startup/bootstrap will take a
>>> bit longer. Flushing typically happens within 60s of the last write.
>>> Waiting for flush and compaction is also a best-practice for read-only
>>> benchmarks.  I'm not sure if someone else on the list has an easier way of
>>> determining when a flush happens, but I typically look at the 'MemRowSet'
>>> memory usage for the tablet on the /mem-trackers HTTP endpoint; it should
>>> show something minimal like 256B if it's fully flushed and empty.  You can
>>> also see details about how much memory is in the block cache on that page,
>>> if that interests you.
>>>
>>> - Dan
>>>
>>> On Thu, Apr 6, 2017 at 11:23 PM, Jason Heo <jason.heo@gmail.com>
>>> wrote:
>>>
>>>> Hi.
>>>>
>>>> I'm using Apache Kudu 1.2 on CDH 5.10.
>>>>
>>>> Currently, I'm doing a performance test of Kudu.
>>>>
>>>> Flushing OS Page Cache is easy, but I don't know how to flush
>>>> `block_cache_capacity_mb` easily.
>>>>
>>>> I currently execute SELECT statement over a unnecessarily table to
>>>> evict cached block of testing table.
>>>>
>>>> It is cumbersome, so I'd like to know is there a command for flushing
>>>> block caches (or another kudu's caches which I don't know yet)
>>>>
>>>> Thanks.
>>>>
>>>> Regards,
>>>> Jason
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: How to flush `block_cache_capacity_mb` easily?

2017-04-07 Thread Todd Lipcon
Hey Jason,

Can I ask what the purposes of the testing is?

One thing to note is that we're currently leaving a fair bit of performance
on the table for cold-cache reads from spinning disks. So, if you find that
the performance is not satisfactory, it's worth being aware that we will
likely make some significant improvements in this area in the future.

https://issues.apache.org/jira/browse/KUDU-1289 has some details.

-Todd

On Fri, Apr 7, 2017 at 8:44 AM, Dan Burkert <danburk...@apache.org> wrote:

> Hi Jason,
>
> There is no command to have Kudu evict its block cache, but restarting the
> tablet server process will have that effect.  Ideally all written data will
> be flushed before the restart, otherwise startup/bootstrap will take a bit
> longer. Flushing typically happens within 60s of the last write.  Waiting
> for flush and compaction is also a best-practice for read-only benchmarks.
> I'm not sure if someone else on the list has an easier way of determining
> when a flush happens, but I typically look at the 'MemRowSet' memory usage
> for the tablet on the /mem-trackers HTTP endpoint; it should show something
> minimal like 256B if it's fully flushed and empty.  You can also see
> details about how much memory is in the block cache on that page, if that
> interests you.
>
> - Dan
>
> On Thu, Apr 6, 2017 at 11:23 PM, Jason Heo <jason.heo@gmail.com>
> wrote:
>
>> Hi.
>>
>> I'm using Apache Kudu 1.2 on CDH 5.10.
>>
>> Currently, I'm doing a performance test of Kudu.
>>
>> Flushing OS Page Cache is easy, but I don't know how to flush
>> `block_cache_capacity_mb` easily.
>>
>> I currently execute SELECT statement over a unnecessarily table to evict
>> cached block of testing table.
>>
>> It is cumbersome, so I'd like to know is there a command for flushing
>> block caches (or another kudu's caches which I don't know yet)
>>
>> Thanks.
>>
>> Regards,
>> Jason
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: How to calculate the optimal value of `maintenance_manager_num_threads`

2017-03-27 Thread Todd Lipcon
Hi Jason,

On Fri, Mar 24, 2017 at 1:39 AM, Jason Heo <jason.heo@gmail.com> wrote:

> Hi,
>
> I'm using Apache Kudu 1.2 on CDH 5.10.
>
> Recently, after reading "Bulk write performance improvements for Kudu 1.4
> <https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit>"
> I've noticed that `maintenance_manager_num_threads` is 4 for the 5
> spinning disks.
>
>
Yes, but I wouldn't take that as necessarily optimal. I'm now doing some
tests with 8 threads as a comparison point.


> In my cluster, each node has 10 SATA disks with RAID 1+0 (WAL and Data
> directory located in the same partition). As Todd suggested, bulk loading
> is doing in PK sorted manner. I think CPU usage and System Load of my
> cluster is not high at this moment, so I think it could be increased a
> little bit more.
>
> Would someone please suggest the number of my environment?
>

Increasing the number of maintenance threads may help if you are falling
behind on compaction and flushes. For compaction, you can tell if you are
falling behind by looking at the "bloom_lookups_per_op" metric. For
flushes, you may be falling behind if you see a lot of "memory pressure
rejections". One area for improvement in our tooling is adding some more
scripts and tools to make these types of diagnosis easier.

In general, it's a tradeoff: more MM threads means more resource
consumption, but possibly better performance. The tradeoff may be
non-linear, though (i.e doubling MM threads won't double performance!)

As Kudu is still a young project, we're still gathering operational
experience from users around topics like this. It would be great if you can
share back any results you find with the community.

Thanks

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Kudu on top of Alluxio

2017-03-27 Thread Todd Lipcon
On Sat, Mar 25, 2017 at 2:54 PM, Mike Percy <mpe...@apache.org> wrote:

> Kudu currently relies on local storage on a POSIX file system. Right now
> there is no support for S3, which would be interesting but is non-trivial
> in certain ways (particularly if we wanted to rely on S3's replication and
> disable Kudu's app-level replication).
>
> I would suggest using only either EXT4 or XFS file systems for production
> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per
> machine for the WAL and with the data disks on either SATA or SSD drives
> depending on the workload. Anything else is untested AFAIK.
>

I would amend this and say that SSD for the WAL is nice to have, but not a
requirement. We do lots of testing on non-SSD test clusters and I'm aware
of many production clusters which also do not have SSD.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera


[ANNOUNCE] Apache Kudu 1.3.0 released

2017-03-20 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.3.0.

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

Apache Kudu 1.3 is a minor release which adds various new features,
improvements, bug fixes, and optimizations on top of Kudu 1.2. Highlights
include:

- significantly improved support for security, including Kerberos
authentication, TLS encryption, and coarse-grained (cluster-level)
authorization
- automatic garbage collection of historical versions of data
- lower space consumption and better performance in default configurations.

The above list of changes is non-exhaustive. Please refer to the release
notes below for an expanded list of important improvements, bug fixes, and
incompatible changes before upgrading:

Download it here: http://kudu.apache.org/releases/1.3.0/
Full release notes:
http://kudu.apache.org/releases/1.3.0/docs/release_notes.html

Thanks to the 25 developers who contributed code or documentation to this
release!

Regards,
The Apache Kudu team


Re: How to reuse tablet server UUID, or removing old one

2017-03-09 Thread Todd Lipcon
It should disappear when you next restart the masters. They don't persist
the list of tservers, but rather learn about them dynamically when they
come up.

-Todd

On Thu, Mar 9, 2017 at 11:05 AM, Alexandre Fouché <afou...@onfocus.io>
wrote:

> Oh ok, so since i need to do replace partitionning on all other nodes, i
> suppose tabletserver5 will get populated when i delete other tabletservers
> one by one.
>
> And indeed, now i see from the webUI, that all my tablets still have 3
> replicas, so Kudu must have ensured a replication of 3 when tabletserver5
> was considered dead. (i wonder how much time after though)
>
> Yet, is it possible to have the masters forget about the dead tablet
> server UUID, so that it does not show up anymore in the webUI ? Or will it
> disapperar after a while or a Kudu restart maybe ?
>
>
> 2017-03-09 17:48 GMT+01:00 Jean-Daniel Cryans <jdcry...@apache.org>:
>
>> Hi Alexandre,
>>
>> Tablet replicas are not tied to a UUID, so removing or reusing one
>> wouldn't achieve what you want. The main thing missing here is that Kudu
>> doesn't do tablet re-balancing at runtime, so tabletserver5 will get
>> tablets the next time a node dies or if you create new tables.
>>
>> Obviously that's something we'd like to address but nobody has come
>> around to doing it so far.
>>
>> Sincèrement,
>>
>> J-D
>>
>> On Thu, Mar 9, 2017 at 6:34 AM, Alexandre Fouché <afou...@onfocus.io>
>> wrote:
>>
>>> Hi all
>>>
>>> I have searched and searched, but could not find how to tell that a
>>> previous tablet server uuid is to be removed
>>> I had to replace disks on tabletserver5, so i deleted all data, since
>>> there were replicas on other servers. Now that i restarted Kudu on
>>> tabletserver5 with empty data, it initialised fine, but since (i saw
>>> afterwards on webUI) it has a new UUID, it is recognised as a new server,
>>> and it does not resync its tablets replcas with other tablet servers
>>> (edited). It has the same hostname as before but different UUID
>>>
>>> And in WebUI, i see the same tabletserver5 with the previous UUID marked
>>> as ‘dead'
>>>
>>> How can i tel Kudu to completely remove the dead tabletserver5 UUID and
>>> populate the new tabletserver5 UUID instead ?
>>>
>>> the `kudu` command line tool does not seem to allow to delete a tablet
>>> server UUID, or decommission
>>> so how ?
>>>
>>> Or other way, how can i recreate an empty Kudu tablet server reusing my
>>> old UUID ?
>>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: mixing range and hash partitioning

2017-02-28 Thread Todd Lipcon
> kudu/integration-tests/alter_table-test.cc#L1106.
>
> - Dan
>
> On Fri, Feb 24, 2017 at 10:53 AM, Paul Brannan <
> paul.bran...@thesystech.com> wrote:
>
> I'm trying to create a table with one-column range-partitioned and another
> column hash-partitioned.  Documentation for add_hash_partitions and
> set_range_partition_columns suggest this should be possible ("Tables must
> be created with either range, hash, or range and hash partitioning").
>
> I have a schema with three INT64 columns ("time", "key", and "value").
> When I create the table, I set up the partitioning:
>
> (*table_creator)
>   .table_name("test_table")
>   .schema()
>   .add_hash_partitions({"key"}, 2)
>   .set_range_partition_columns({"time"})
>   .num_replicas(1)
>   .Create()
>
> I later try to add a partition:
>
> auto timesplit(KuduSchema & schema, std::int64_t t) {
>   auto split = schema.NewRow();
>   check_ok(split->SetInt64("time", t));
>   return split;
> }
>
> alterer->AddRangePartition(
>   timesplit(schema, date_start),
>   timesplit(schema, next_date_start));
>
> check_ok(alterer->Alter());
>
> But I get an error "Invalid argument: New range partition conflicts with
> existing range partition".
>
> How are hash and range partitioning intended to be mixed?
>
>
>
>
>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: File descriptor limit for WAL

2017-02-24 Thread Todd Lipcon
On Fri, Feb 24, 2017 at 12:39 PM, Adar Dembo <a...@cloudera.com> wrote:

> It's definitely safe to increase the ulimit for open files; we
> typically test with higher values (like 32K or 64K). We don't use
> select(2) directly; any fd polling in Kudu is done via libev which I
> believe uses epoll(2) under the hood. There's one other place where we
> use ppoll() (in RPC negotiation), but no select().
>
> A bit of historical curiosity: we actually had this bug a few years back
and fixed it, see 82cf3724077a8fb639a44dd86f04d10ecbedabf4

-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'

2017-02-16 Thread Todd Lipcon
On Tue, Feb 14, 2017 at 11:51 PM, Frank Heimerzheim <fh.or...@gmail.com>
wrote:

> Hello Todd,
>
> i was so naive to assume a decent automatic type assignment. With explicit
> type assignement via schema everything works fine.
>
> From an pythonic viewpoint the necessity to infer data types is not what i
> want to do all day. But this is a philosophical discussion and i got
> trapped with this way of thinking right now.
>
> I´ve expected that the attemp to store 42 in an int8 would work and the
> attemp to store 4242 would raise an error. But the drive ist not
> "inteligent" enough to test every individual case but checks ones for a
> data type. Not pythonic, but now that i´m realy aware of the topic: no
> further problem.
>

Yea, I see that the "static typing" we are doing isn't very pythonic.

Our thinking here (which really comes from the thinking on the C++ side) is
that, given the underlying Kudu data has static column typing, we didn't
want to have a scenario where someone uses a too-small column, and then
tests on data that happens to fit in range. Then, they get a surprise one
day when all of their inserts start failing with "data out of range for
int32" errors or whatever. Forcing people to evaluate the column sizes up
front avoids nasty surprises later.

But, maybe you can see my biases towards static-typed languages leaking
through here ;-)

-Todd


> 2017-02-14 19:44 GMT+01:00 Todd Lipcon <t...@cloudera.com>:
>
>> Hi Frank,
>>
>> Could you try something like:
>>
>> data = [(42, 2017, 'John')]
>> schema = StructType([
>> StructField("id", ByteType(), True),
>> StructField("year", ByteType(), True),
>> StructField("name", StringType(), True)])
>> df = sqlContext.createDataFrame(data, schema)
>>
>> That should explicitly set the types (based on my reading of the pyspark
>> docs for createDataFrame)
>>
>> -Todd
>>
>>
>> On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh.or...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> here a snippet which produces the error.
>>>
>>> Call from the shell:
>>> spark-submit --jars /opt/storage/data_nfs/cloudera
>>> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>>>
>>>
>>> Snippet from the python-code test.py:
>>>
>>> (..)
>>> builder = kudu.schema_builder()
>>> builder.add_column('id', kudu.int64, nullable=False)
>>> builder.add_column('year', kudu.int32)
>>> builder.add_column('name', kudu.string)
>>> (..)
>>>
>>> (..)
>>> data = [(42, 2017, 'John')]
>>> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
>>> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', 
>>> kudu_master)\
>>>  .option('kudu.table', 
>>> kudu_table)\
>>>  .mode('append')\
>>>  .save()
>>> (..)
>>>
>>> Error:
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 
>>> 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
>>> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
>>> 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
>>> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 
>>> (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't 
>>> [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32
>>> at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
>>> at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
>>> at 
>>> org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
>>> at 
>>> org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
>>> at 
>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> at 
>>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>> at 
>>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>> at 
>>> org.apache.kudu.spark.kud

Re: Fetch row on the basis of composite primary key from table using Java API

2017-02-15 Thread Todd Lipcon
Hi Devender,

Yes, that's right -- create an equality predicate for all three components
of the key, and the Kudu client will be smart enough to fetch just that one
row from the server that hosts it.

-Todd

On Wed, Feb 15, 2017 at 2:44 AM, Devender Yadav <
devender.ya...@impetus.co.in> wrote:

> Hi All,
>
>
> I am using Kudu Java API.
>
>
>
> I have a table metrics.
>
> CREATE TABLE metrics (
> host STRING NOT NULL,
> metric STRING NOT NULL,
> time INT64 NOT NULL,
> value DOUBLE NOT NULL,
> PRIMARY KEY (host, metric, time),
> );
>
>
> I need to fetch row on the basis of primary key from metrics table.
>
>
> Do I need to create 3 comparison​ predicates for all the primary key
> columns  KuduPredicate.newComparisonPredicate​() ?
>
>
>
>
> Regards,
> Devender
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Adding examples to docs?

2017-02-13 Thread Todd Lipcon
Also, we've been discussing with a couple of Impala contributors about
whether these docs should continue to live in the Kudu project long term.
It seems a bit redundant for us to try to document Impala, rather than just
linking to the Impala docs.

You can find some docs on using Impala with Kudu here:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_kudu.html
(sorry to link to a vendor doc site -- Impala's still in the process of
incubation, and part of that involves moving more of the documentation
upstream)

Thanks
-Todd

On Sun, Feb 12, 2017 at 6:39 PM, Dan Burkert <danburk...@apache.org> wrote:

> Hi Darren,
>
> Assuming you are asking about Impala syntax, you can find some examples
> here: https://kudu.apache.org/docs/kudu_impala_integration.
> html#advanced_partitioning
>
> - Dan
>
> On Sun, Feb 12, 2017 at 6:37 PM, Darren Hoo <darren@gmail.com> wrote:
>
>> specifically what is the SQL syntax for multi-level partitioning?
>>
>> On Mon, Feb 13, 2017 at 9:53 AM, Darren Hoo <darren@gmail.com> wrote:
>>
>>> The documentation https://kudu.apache.org/docs/schema_design.html are
>>> missing SQL examples.
>>>
>>> I can not find the exact SQL syntax for partition management.
>>>
>>> can this be added?
>>>
>>> Thanks in advance.
>>>
>>>
>>>
>>>
>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


  1   2   >