Re: [ANNOUNCE] Welcoming Abhishek Chennaka as Kudu committer and PMC member
Congrats on the recognition of you work, Abhishek! Todd On Wed, Feb 22, 2023, 6:15 PM 邓科 wrote: > Congrats Abhishek!!! > > Yingchun Lai 于2023年2月23日周四 08:05写道: > > > Congrats! > > > > Mahesh Reddy 于2023年2月23日 周四03:07写道: > > > > > Congrats Abhishek!!! Great work and well deserved! > > > > > > On Wed, Feb 22, 2023 at 9:12 AM Andrew Wong wrote: > > > > > > > Hi Kudu community, > > > > > > > > I'm happy to announce that the Kudu PMC has voted to add Abhishek > > > Chennaka > > > > as a > > > > new committer and PMC member. > > > > > > > > Some of Abhishek's contributions include: > > > > - Improving the usability of ksck and backup/restore tooling > > > > - Introducing bootstrapping metrics and webserver pages > > > > - Adding alter-table support for the dynamic hash ranges project > > > > - Driving the on-going effort for partition-local incrementing > columns > > > > > > > > Abhishek has also been a helpful participant in the project Slack > and a > > > > steady > > > > reviewer on Gerrit. > > > > > > > > Please join me in congratulating Abhishek! > > > > > > > > > -- > > Best regards, > > Yingchun Lai > > >
Re: Implications/downside of increasing rpc_service_queue_length
Hi Mauricio, Sorry for the late reply on this one. Hope "better late than never" is the case here :) As you implied in your email, the main issue with increasing queue length to deal with queue overflows is that it only helps with momentary spikes. According to queueing theory (and intuition) if the rate of arrival of entries into a queue is faster than the rate of processing items in that queue, then the queue length will grow. If this is a transient phenomenon (eg a quick burst of requests) then having a larger queue capacity will prevent overflows, but if this is a persistent phenomenon, then there is no length of queue that is sufficient to prevent overflows. The one exception here is that if the number of potential concurrent queue entries is itself bounded (eg because there is a bounded number of clients). According to the above theory, the philosophy behind the default short queue is that longer queues aren't a real solution if the cluster is overloaded. That said, if you think that the issues are just transient spikes rather than a capacity overload, it's possible that bumping the queue length (eg to 100) can help here. In terms of things to be aware of: having a longer queue means that the amount of memory taken by entries in the queue is increased proportionally. Currenlty, that memory is not tracked as part of Kudu's Memtracker infrastructure, but it does get accounted for in the global heap and can push the serve into "memory pressure" mode where requests will start getting rejected, rowsets will get flushed, etc. I would recommend that if you increase your queues you make sure that you have a relatively larger memory limit allocated to your tablet servers and watch out for log messages and metrics indicating persistent memory pressure (particularly in the 80%+ range where things start getting dropped a lot). Long queues are also potentially an issue in terms of low-latency requests. The longer the queue (in terms of items) the longer the latency of elements waiting in that queue. If you have some element of latency SLAs, you should monitor them closely as you change queue length configuration. Hope that helps -Todd
Re: Kudu - Dremio
For what it's worth, Dremio is based on Apache Drill, which does have some integration. I think the integration was originally authored by early employees of Dremio, in fact. Might be worth reaching out to them about support. -Todd On Sun, Mar 29, 2020 at 11:43 AM Adar Lieber-Dembo wrote: > I don't believe that's true; there's support for Kudu in other SQL engines: > - Presto: https://prestodb.io/docs/current/connector/kudu.html > - Apache Hive: > https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration > - Apache Drill: > https://github.com/apache/drill/tree/master/contrib/storage-kudu > > The exact level of support varies, but we've found it to be fairly > straight-forward to add Kudu support to existing SQL engines. > > > On Sun, Mar 29, 2020 at 10:21 AM Boris Tyukin > wrote: > > > > when I was looking at Dremio some time ago (very interesting technology > and I love the idea of query rewrites and materialized viewed federation > from different sources), it did not support Impala which you have to use > currently to get SQL support with Kudu. > > > > On Sun, Mar 29, 2020 at 11:44 AM pino patera > wrote: > >> > >> Hi > >> anyone integrated Kudu into Dremio (/www.dremio.com) data lake? > >> Any alternative suggestion (i.e. Presto?) ? > >> > >> Thanks > -- Todd Lipcon Software Engineer, Cloudera
Re: Write transactions latency
Hi Dmitry, I would guess that deleting a range partition could affect latency of other operations mostly because the deletion of a bunch of data on disk will result in some IO and filesystem metadata traffic. This load may impact other operations on the same node, though I wouldn't expect it to be drastic. Could you share a bit more data about what you're seeing? -Todd On Mon, Feb 24, 2020 at 12:27 AM Дмитрий Павлов wrote: > Hi > > Does update schema transactions (in my case removing old range partition) > effect write transactions latency for other range partition? Am asking > because i see some correlation between update schema operation(deleting > range partition) time and number transactions in-light on Kudu tablet > servers. > > Regards Dmitry > > > -- Todd Lipcon Software Engineer, Cloudera
Re: it is a good idea to dockerize kudu
Hi Zhike, Typically when running software like Kudu in containers, the data volumes would be bind-mounted, not using aufs or overlayfs, etc. So, there should be no measurable overhead. -Todd On Sun, Feb 16, 2020 at 7:21 PM Zhike Chen wrote: > Hi, > > I am going to deploy a kudu cluster in a production environment. I found > out > that there are no binary builds of kudu; it can only build from source or > use > the experimental docker image. Considering build from source and then > install it > natively take a lot of time, I am wondering it is a good idea to dockerize > kudu. > The problem I concern about dockerizing kudu is storage performance loss. I > found out this Docker storage driver benchmarks (last updated October 2017) > https://github.com/chriskuehl/docker-storage-benchmark > > Best regards, > Kyle Zhike Chen > -- Todd Lipcon Software Engineer, Cloudera
Re: Incorta vs Kudu
Hi Boris, I hadn't heard of Incorta before. I'm surprised you mentioned they're getting traction and buzz. The site definitely seems light on technical details, so can't really answer a comparison to Kudu. It does seem like they're more focused on batch than real-time. -Todd On Tue, Oct 15, 2019 at 7:33 AM Boris Tyukin wrote: > Hi guys, I was just reading about incorta. They get a lot of traction and > buzz recently. While they do not explain how it actually works but I got a > feeling their "secret" technology is very similar to Kudu. Just curious if > you looked at it and compared to Kudu/Impala combo. They mentioned their > super duper secret and efficient compaction process. > > https://incortablog.wordpress.com/2017/03/25/incorta-parquet-and-compaction/ > > > One difference I've noticed though they say they can do real-time but I > just watched a demo and looks like it is classical batch/incremental > process. > https://community.incorta.com/t/18d8x2/data-hubmaterialized-view-question > > > https://community.incorta.com/t/18jndy/what-are-the-types-of-data-load-that-incorta-supports > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Underutilization of hardware resources with smaller number of Tservers
Would be useful to capture top -H during the workload as well to see if any particular threads are at 100%. Could be the reactor thread acting as a bottleneck On Fri, Jul 12, 2019, 10:54 AM Adar Lieber-Dembo wrote: > Thanks for the detailed summary and analysis. I want to make sure I > understand the overall cluster topology. You have two physical hosts, > each running one VM, with each VM running either 4 or 8 tservers. Is > that correct? > > Here are some other questions: > - Have you verified that your _client_ machines (i.e. the ETL job > machine or the machine running loadgen) aren't fully saturated? If it > is, that would explain why the server machines aren't fully utilized. > You mention adding more and more tservers until the hardware is > saturated so I imagine the client machines aren't the bottleneck, but > it's good to check. > - How is the memory utilization? Are the VMs overcommitted? Is there > any swapping? I noticed you haven't configured a memory limit for the > tservers, which means each will try to utilize 80% of the available > RAM on the machine. That's significant overcommitment when there are > multiple tservers in each VM. > - The tserver configuration shows three data directories per NVMe. Why > not just one data directory per NVMe? > - The schema is helpful, but it's not clear how many hash buckets are > in use. I imagine you vary that based on the number of tservers, but > it'd be good to know the details here. > > On an unrelated note, you mentioned using one master but the > configurations show a cluster with two. That's probably not related to > the performance issue, but you should know that a 2-master > configuration can't tolerate any faults; you should use 3 masters if > your goal is tolerate the loss of a master. > > > On Fri, Jul 12, 2019 at 9:19 AM Dmitry Degrave wrote: > > > > Hi All, > > > > We have a small 1.9.0 Kudu cluster for pre-production testing with > > rather powerful nodes. Each node has: > > > > 28 cores with HT (total 56 logical cores) > > Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz > > > > RAM: 503GiB > > 10 NVME drives > > > > We run OpenStack VMs on top of them, 1 VM per physical host. Each VM > > has 50 cores, 500GiB RAM and pci passthrough access to 4 NVMe drives. > > > > It is expected to see the same performance with a single tserver per > > VM comparing with multiple tservers per VM (giving the total number of > > MM threads per VM is equal) - based on discussion we had on > > getkudu/kudu-general channel. > > > > This is not what we actually have seen during the last months. Based > > on our experience, more tservers utilize hardware resources better and > > give better overall ETL performance than less tservers. > > > > We are trying to understand the reasons. > > > > To demonstrate the difference, below are results and configuration for > > running cluster on 2 VMs, each on separate physical host, with 4 and 8 > > tservers per VM and 17 maintenance manager threads per VM total. We > > run a single master per VM in both cases with identical config. > > > > We see the same pattern with more VMs in cluster and with different > > number of tservers per VM (keeping total number of maintenance manager > > threads per VM equal) - the more tservers per VM, the better > > performance (till certain limit when we reach h/w capacity). > > > > Starting 4 tservers per VM: > > https://gist.github.com/dnafault/96bcac24974ea4e50384ecd22161510d > > > > Starting 8 tservers per VM: > > https://gist.github.com/dnafault/a16ae893b881b44d95378b5e4a056862 > > > > Starting 1 master per VM: > > https://gist.github.com/dnafault/87ec3565e36b0a88568f8b86f37f9521 > > > > In both cases we start ETL from Spark node on separate physical host > > with identical hardware resources as above. ETL is finished in 910 min > > with 4 tservers and in 830 min with 8 tservers. > > > > Looking at how hardware is utilized, we can see that all resources are > > utilized less with 4 tservers: > > > > - CPUs (50 cores on VM): > > 4 tservers load avg: > > > https://user-images.githubusercontent.com/20377386/60939960-21ebee80-a31d-11e9-911a-1e5cc2abcb26.png > > 8 tservers load avg: > > > https://user-images.githubusercontent.com/20377386/60940404-b3a82b80-a31e-11e9-8ce7-2f843259bf91.png > > > > - Network: > > 4 tservers, kb: > > > https://user-images.githubusercontent.com/20377386/60940071-96bf2880-a31d-11e9-8aa6-6decaa3f651a.png > > 8 tservers, kb: > > > https://user-images.githubusercontent.com/20377386/60940503-0bdf2d80-a31f-11e9-85de-f5e4b0959c1b.png > > > > 4 tservers, pkts: > > > https://user-images.githubusercontent.com/20377386/60940063-92930b00-a31d-11e9-81cf-b7e9f05e53ec.png > > 8 tservers, pkts: > > > https://user-images.githubusercontent.com/20377386/60940441-d0dcfa00-a31e-11e9-8c03-330024339a34.png > > > > - NVMe drives: > > 4 tservers: > > > https://user-images.githubusercontent.com/20377386/60940249-2238b980-a31e-11e9-8fff-2233ce18f74d.png > > 8
Re:
On Tue, Jul 2, 2019 at 1:25 AM Дмитрий Павлов wrote: > > Hi guys > > I'm encountered a strange behaviour about replication of 2 tablets in my > table in Kudu cluster > The table was in UNDER REPLICATED status. So i stoped all activity on > cluster to make it cold. > Do you have the original ksck output for this table while it was in UNDER_REPLICATED state? Do you have the tserver and master logs from the time during which it was under-replicated? If you grep for this tablet ID you can hopefully find a reason why it was unable to re-replicate. > But even in 2 hours table was in UNDER REPLICATED state so i checked > rows_updated/rows_inserted metric and found out what replication process is > very slow 1-2 K rows per second. I checked the logs of 2 tservers where > tablets were located i found following errors: > Note that rows_updated and rows_inserted metrics are unrelated to re-replication of under-replicated tablets. They only represent the number of rows inserted/updated by end users of the tablet. Re-replication of a missing tablet happens by physical data copies, not by row-level operations. > > W0701 12:10:07.12425310 kernel_stack_watchdog.cc:198] Thread 141396 > stuck at /tmp/apache-kudu-1.9 > Kernel stack: > > [] 0x > > > > User stack: > > @ 0x7f04971a45d0 (unknown) > > @ 0xb4b21c kudu::consensus::LogCache::EvictSomeUnlocked() > > @ 0xb4bec6 kudu::consensus::LogCache::EvictThroughOp() > > @ 0xb47d2f > kudu::consensus::PeerMessageQueue::ResponseFromPeer() > @ 0xb490b1 > kudu::consensus::PeerMessageQueue::LocalPeerAppendFinished() > @ 0xb4bbcc kudu::consensus::LogCache::LogCallback() > > @ 0xb972d2 kudu::log::Log::AppendThread::HandleGroup() > > @ 0xb97c2d kudu::log::Log::AppendThread::DoWork() > > @ 0x1e5bdff kudu::ThreadPool::DispatchThread() > > @ 0x1e51634 kudu::Thread::SuperviseThread() > > @ 0x7f049719cdd5 start_thread > > @ 0x7f0495473ead __clone > This is just a warning about a potential latency blip, and likely completely unrelated to the problem you're reporting. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: WAL size estimation
Hey Pavel, I went back and looked at the source here. It appears that 24MB is the expected size for an index file -- each entry is 24 bytes and the index file should keep 1M entries. That said, for a "cold tablet" (in which you'd have only a small number of actual WAL files) I would expect only a single index file. The example you gave where you have 12 index files but only one WAL segment seems quite fishy to me. Having 12 index files indicates you have 12M separate WAL entries, but given you have only 8MB of WAL, that indicates each entry is less than one byte large, which doesn't make much sense at all. If you go back and look at that same tablet now, did it eventually GC those log index files? -Todd On Wed, Jun 19, 2019 at 1:53 AM Pavel Martynov wrote: > > Try adding the '-p' flag here? That should show preallocated extents. > Would be interesting to run it on some index file which is larger than 1MB, > for example. > > # du -h --apparent-size index.00108 > 23M index.00108 > > # du -h index.00108 > 23M index.00108 > > # xfs_bmap -v -p index.00108 > index.00108: > EXT: FILE-OFFSET BLOCK-RANGEAG AG-OFFSET TOTAL > FLAGS >0: [0..2719]: 1175815920..1175818639 2 (3704560..3707279) 2720 > 0 >1: [2720..5111]:1175828904..1175831295 2 (3717544..3719935) 2392 > 0 >2: [5112..7767]:1175835592..1175838247 2 (3724232..3726887) 2656 > 0 >3: [7768..10567]: 1175849896..1175852695 2 (3738536..3741335) 2800 > 0 >4: [10568..15751]: 1175877808..1175882991 2 (3766448..3771631) 5184 > 0 >5: [15752..18207]: 1175898864..1175901319 2 (3787504..3789959) 2456 > 0 >6: [18208..20759]: 1175909192..1175911743 2 (3797832..3800383) 2552 > 0 >7: [20760..23591]: 1175921616..1175924447 2 (3810256..3813087) 2832 > 0 >8: [23592..26207]: 1175974872..1175977487 2 (3863512..3866127) 2616 > 0 >9: [26208..28799]: 1175989496..1175992087 2 (3878136..3880727) 2592 > 0 > 10: [28800..31199]: 1175998552..1176000951 2 (3887192..3889591) 2400 > 0 > 11: [31200..33895]: 1176008336..1176011031 2 (3896976..3899671) 2696 > 0 > 12: [33896..36591]: 1176031696..1176034391 2 (3920336..3923031) 2696 > 0 > 13: [36592..39191]: 1176037440..1176040039 2 (3926080..3928679) 2600 > 0 > 14: [39192..41839]: 1176072008..1176074655 2 (3960648..3963295) 2648 > 0 > 15: [41840..44423]: 1176097752..1176100335 2 (3986392..3988975) 2584 > 0 > 16: [44424..46879]: 1176132144..1176134599 2 (4020784..4023239) 2456 > 0 > > > > > > ср, 19 июн. 2019 г. в 10:56, Todd Lipcon : > >> >> >> On Wed, Jun 19, 2019 at 12:49 AM Pavel Martynov >> wrote: >> >>> Hi Todd, thanks for the answer! >>> >>> > Any chance you've done something like copy the files away and back >>> that might cause them to lose their sparseness? >>> >>> No, I don't think so. Recently we experienced some problems with >>> stability with Kudu, and ran rebalance a couple of times, if this related. >>> But we never used fs commands like cp/mv against Kudu dirs. >>> >>> I ran du on all-WALs dir: >>> # du -sh /mnt/data01/kudu-tserver-wal/ >>> 12G /mnt/data01/kudu-tserver-wal/ >>> >>> # du -sh --apparent-size /mnt/data01/kudu-tserver-wal/ >>> 25G /mnt/data01/kudu-tserver-wal/ >>> >>> And on WAL with a many indexes: >>> # du -sh --apparent-size >>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f >>> 306M >>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f >>> >>> # du -sh >>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f >>> 296M >>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f >>> >>> >>> > Also, any chance you're using XFS here? >>> >>> Yes, exactly XFS. We use CentOS 7.6. >>> >>> What is interesting, there are no many holes in index files in >>> /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f (WAL dir >>> that I mention before). Only single hole in single index file (of 13 files): >>> # xfs_bmap -v index.00120 >>> >> >> Try adding the '-p' flag here? That should show preallocated extents. >> Would be interesting to run it on some index file which is larger than 1MB, >> for example. >> >> >>> index.00120: >>> EXT: FILE-OFFSET
Re: Need information about internals of InList predicate
Hi Sergey, The optimization you're looking for is essentially to realize that IN-list on a primary key prefix can be converted as follows: scan(PK in (1,2,3)) -> scan(PK = 1 OR PK = 2 OR PK = 3) -> scan(PK = 1) union all scan(pk = 2) union all scan(PK = 3) Currently, the tserver doesn't support conversion of a single user-facing scan into multiple internal scan ranges in the general case. Doing so would require a bit of surgery on the tablet server to understand the concept that a scan has a set of disjoint PK ranges rather than a single range associated. I filed a JIRA to support this here: https://issues.apache.org/jira/browse/KUDU-2875 That said, there's a separate optimization which is simpler to implement, which is to notice within a given DiskRowSet (small chunk of rows) that only a single value in the IN-list can be present. In that case the IN-list can convert, locally, to an equality predicate which may be satisfied by a range scan or skipped entirely. I added this note to https://issues.apache.org/jira/browse/KUDU-1644 Thanks Todd On Tue, Jun 25, 2019 at 9:24 PM Sergey Olontsev wrote: > Does anyone could help to find more about how InList predicates work? > > I have a bunch of values (sometimes just a couple, or a few, but > potentially it could be tens or hundreds), and I have a table with primary > key on column for the searching values with hash partitioning. > > And I've notices, that several separate searches by primary key with > Comparison predicate usually work faster that one with InList predicate. > I'm looking and Scanners information on gui and see, that by using > Comparison predicate my app is reading only 1 block and it takes > miliseconds, but with InList predicate it reads ~1.6 blocks several times > (scanning with a batch of 1 million rows) and each scanner takes about > 1-1.5 seconds to complete. > > So, really need more information about how exactly InList predicates are > implemented and behave. Anyone could provide any links? Unfortunately, I > was unable find any information, a few JIRA tasks only, but that didn't > helped. > > https://issues.apache.org/jira/browse/KUDU-2853 > https://issues.apache.org/jira/browse/KUDU-1644 > > Best regards, Sergey. > -- Todd Lipcon Software Engineer, Cloudera
Re: WAL size estimation
On Wed, Jun 19, 2019 at 12:49 AM Pavel Martynov wrote: > Hi Todd, thanks for the answer! > > > Any chance you've done something like copy the files away and back that > might cause them to lose their sparseness? > > No, I don't think so. Recently we experienced some problems with stability > with Kudu, and ran rebalance a couple of times, if this related. But we > never used fs commands like cp/mv against Kudu dirs. > > I ran du on all-WALs dir: > # du -sh /mnt/data01/kudu-tserver-wal/ > 12G /mnt/data01/kudu-tserver-wal/ > > # du -sh --apparent-size /mnt/data01/kudu-tserver-wal/ > 25G /mnt/data01/kudu-tserver-wal/ > > And on WAL with a many indexes: > # du -sh --apparent-size > /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f > 306M/mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f > > # du -sh /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f > 296M/mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f > > > > Also, any chance you're using XFS here? > > Yes, exactly XFS. We use CentOS 7.6. > > What is interesting, there are no many holes in index files in > /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f (WAL dir > that I mention before). Only single hole in single index file (of 13 files): > # xfs_bmap -v index.00120 > Try adding the '-p' flag here? That should show preallocated extents. Would be interesting to run it on some index file which is larger than 1MB, for example. > index.00120: > EXT: FILE-OFFSET BLOCK-RANGEAG AG-OFFSET TOTAL >0: [0..4231]: 1176541248..1176545479 2 (4429888..4434119) 4232 >1: [4232..9815]:1176546592..1176552175 2 (4435232..4440815) 5584 >2: [9816..11583]: 1176552832..1176554599 2 (4441472..4443239) 1768 >3: [11584..13319]: 1176558672..1176560407 2 (4447312..4449047) 1736 >4: [13320..15239]: 1176565336..1176567255 2 (4453976..4455895) 1920 >5: [15240..17183]: 1176570776..1176572719 2 (4459416..4461359) 1944 >6: [17184..18999]: 1176575856..1176577671 2 (4464496..4466311) 1816 >7: [19000..20927]: 1176593552..1176595479 2 (4482192..4484119) 1928 >8: [20928..22703]: 1176599128..1176600903 2 (4487768..4489543) 1776 >9: [22704..24575]: 1176602704..1176604575 2 (4491344..4493215) 1872 > 10: [24576..26495]: 1176611936..1176613855 2 (4500576..4502495) 1920 > 11: [26496..26655]: 1176615040..1176615199 2 (4503680..4503839) 160 > 12: [26656..46879]: hole 20224 > > But in some other WAL I see like this: > # xfs_bmap -v > /mnt/data01/kudu-tserver-wal/wals/508ecdfa8904bdb97a02078a91822af/index.0 > > /mnt/data01/kudu-tserver-wal/wals/508ecdfa89054bdb97a02078a91822af/index.0: > EXT: FILE-OFFSET BLOCK-RANGEAG AG-OFFSETTOTAL >0: [0..7]: 1758753776..1758753783 3 (586736..586743) 8 >1: [8..46879]: hole 46872 > > Looks like there actually used only 8 blocks and all other blocks are the > hole. > > > So looks like I can use formulas with confidence. > Normal case: 8 MB/segment * 80 max segments * 2000 tablets = 1,280,000 MB > = ~1.3 TB (+ some minor index overhead) > Worse case: 8 MB/segment * 1 segment * 2000 tablets = 1,280,000 MB = ~16 > GB (+ some minor index overhead) > > Right? > > > ср, 19 июн. 2019 г. в 09:35, Todd Lipcon : > >> Hi Pavel, >> >> That's not quite expected. For example, on one of our test clusters here, >> we have about 65GB of WALs and about 1GB of index files. If I recall >> correctly, the index files store 8 bytes per WAL entry, so typically a >> couple orders of magnitude smaller than the WALs themselves. >> >> One thing is that the index files are sparse. Any chance you've done >> something like copy the files away and back that might cause them to lose >> their sparseness? If I use du --apparent-size on mine, it's total of about >> 180GB vs the 1GB of actual size. >> >> Also, any chance you're using XFS here? XFS sometimes likes to >> preallocate large amounts of data into files while they're open, and only >> frees it up if disk space is contended. I think you can use 'xfs_bmap' on >> an index file to see the allocation status, which might be interesting. >> >> -Todd >> >> On Tue, Jun 18, 2019 at 11:12 PM Pavel Martynov >> wrote: >> >>> Hi guys! >>> >>> We want to buy SSDs for TServers WALs for our cluster. I'm working on >>> capacity estimation
Re: WAL size estimation
Hi Pavel, That's not quite expected. For example, on one of our test clusters here, we have about 65GB of WALs and about 1GB of index files. If I recall correctly, the index files store 8 bytes per WAL entry, so typically a couple orders of magnitude smaller than the WALs themselves. One thing is that the index files are sparse. Any chance you've done something like copy the files away and back that might cause them to lose their sparseness? If I use du --apparent-size on mine, it's total of about 180GB vs the 1GB of actual size. Also, any chance you're using XFS here? XFS sometimes likes to preallocate large amounts of data into files while they're open, and only frees it up if disk space is contended. I think you can use 'xfs_bmap' on an index file to see the allocation status, which might be interesting. -Todd On Tue, Jun 18, 2019 at 11:12 PM Pavel Martynov wrote: > Hi guys! > > We want to buy SSDs for TServers WALs for our cluster. I'm working on > capacity estimation for this SSDs using "Getting Started with Kudu" book, > Chapter 4, Write-Ahead Log ( > https://www.oreilly.com/library/view/getting-started-with/9781491980248/ch04.html > <https://www.oreilly.com/library/view/getting-started-with/9781491980248/ch04.html#idm139738927926240> > ). > > NB: we use default Kudu WAL configuration settings. > > There is a formula for worse-case: > 8 MB/segment * 80 max segments * 2000 tablets = 1,280,000 MB = ~1.3 TB > > So, this formula takes into account only segment files. But in our > cluster, I see that every segment file has >= 1 corresponding index files. > And every index file actually larger than segment file. > > Numbers from one of our nodes. > WALs count: > $ ls /mnt/data01/kudu-tserver-wal/wals/ | wc -l > 711 > > Overall WAL size: > $ du -d 0 -h /mnt/data01/kudu-tserver-wal/ > 13G /mnt/data01/kudu-tserver-wal/ > > Size of all segment files: > $ find /mnt/data01/kudu-tserver-wal/ -type f -name 'wal-*' -exec du -ch {} > + | grep total$ > 6.1Gtotal > > Size of all index files: > $ find /mnt/data01/kudu-tserver-wal/ -type f -name 'index*' -exec du -ch > {} + | grep total$ > 6.5Gtotal > > So I have questions. > > 1. How can I estimate the size of index files? > Looks like in our cluster size of index files approximately equal to size > segment files. > > 2. There is some WALs with more than one index files. For example: > $ ls -lh > /mnt/data01/kudu-tserver-wal/wals/779a382ea4e6464aa80ea398070a391f/ > total 296M > -rw-r--r-- 1 root root 23M Jun 18 21:31 index.00108 > -rw-r--r-- 1 root root 23M Jun 18 21:41 index.00109 > -rw-r--r-- 1 root root 23M Jun 18 21:52 index.00110 > -rw-r--r-- 1 root root 23M Jun 18 22:10 index.00111 > -rw-r--r-- 1 root root 23M Jun 18 22:22 index.00112 > -rw-r--r-- 1 root root 23M Jun 18 22:35 index.00113 > -rw-r--r-- 1 root root 23M Jun 18 22:48 index.00114 > -rw-r--r-- 1 root root 23M Jun 18 23:01 index.00115 > -rw-r--r-- 1 root root 23M Jun 18 23:14 index.00116 > -rw-r--r-- 1 root root 23M Jun 18 23:27 index.00117 > -rw-r--r-- 1 root root 23M Jun 18 23:40 index.00118 > -rw-r--r-- 1 root root 23M Jun 18 23:52 index.00119 > -rw-r--r-- 1 root root 23M Jun 19 01:13 index.00120 > -rw-r--r-- 1 root root 8.0M Jun 19 01:13 wal-07799 > > Is this a normal situation? > > 3. Not a question. Please, consider adding documentation about the > estimation of WAL storage. Also, I can't found any mentions about index > files, except here > https://kudu.apache.org/docs/scaling_guide.html#file_descriptors. > > Thanks! > > -- > with best regards, Pavel Martynov > -- Todd Lipcon Software Engineer, Cloudera
Re: Kudu CLI tool JSON format
I guess the issue is that we use rapidjson's 'String' support to write out C++ strings, which are binary data, not valid UTF8. That's somewhat incorrect of us, and we should be base64-encoding such binary data. Fixing this is a little bit incompatible, but for something like partition keys I think we probably should do it anyway and release note it, considering partition keys are quite likely to be invalid UTF8. -Todd On Tue, Jun 11, 2019 at 6:08 AM Pavel Martynov wrote: > Hi, guys! > > We trying to use an output of "kudu cluster ksck master -ksck_format > json_compact" for integration with our monitoring system and hit a little > strange. Some part of output can't be read as UTF-8 with Python 3: > $ kudu cluster ksck master -ksck_format json_compact > kudu.json > $ python > with open(' kudu.json', mode='rb') as file: > bs = file.read() > bs.decode('utf-8') > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position > 705196: invalid start byte > > There how SublimeText shows this block of text: > https://yadi.sk/i/4zpWKZ37iP8OEA > As you can see kudu tool encodes zeros as \u, but don't encode some > other non-text bytes. > > What do you think about it? > > -- > with best regards, Pavel Martynov > -- Todd Lipcon Software Engineer, Cloudera
[ANNOUNCE] Welcoming Yingchun Lai as a Kudu committer and PMC member
Hi Kudu community, I'm happy to announce that the Kudu PMC has voted to add Yingchun Lai as a new committer and PMC member. Yingchun has been contributing to Kudu for the last 6-7 months and contributed a number of bug fixes, improvements, and features, including: - new CLI tools (eg 'kudu table scan', 'kudu table copy') - fixes for compilation warnings, code cleanup, and usability improvements on the web UI - support for prioritization of tables for maintenance manager tasks - CLI support for config files to make it easier to connect to multi-master clusters Yingchun has also been contributing by helping new users on Slack, and helps operate 6 production clusters at Xiaomi, one of our larger installations in China. Please join me in congratulating Yingchun! -Todd
Re: problems with impala+kudu
This sounds like an error coming from the Impala layer -- Kudu doesn't use Thrift for communication internally. At first glance it looks like one of your Impala daemons may have crashed while trying to execute the query, so I'd look around for impalad logs indicating that. If you need further assistance I think the impala user mailing list may be able to help more. Thanks Todd On Fri, May 17, 2019 at 8:51 AM 林锦明 wrote: > Dear friends, > > When I try to insert data into kudu table by impala sql, here comes the > exception: TSocket read 0 bytes (code THRIFTTRANSPORT): > TTransportException('TSocket read 0 bytes',). > Could you pls tell me how to deal with this problem? By the way, the kudu > is installed by rpm, the relatived url: > https://github.com/MartinWeindel/kudu-rpm. > > > Best wishes. > yours truly, > Jack Lin > -- Todd Lipcon Software Engineer, Cloudera
Re: "broadcast" tablet replication for kudu?
Hey Boris, Sorry to say that the situation is still the same. -Todd On Wed, Apr 24, 2019 at 9:02 AM Boris Tyukin wrote: > sorry to revive the old thread but curious if there is a better solution 1 > year after...We have a few small tables (under 300k rows) which are > practically used with every single query and to make things worse joined > more than once in the same query. > > Is there a way to replicate this table on every node to improve > performance and avoid broadcasting this table every time? > > On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon wrote: > >> >> >> On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin wrote: >> >>> Hi Todd, >>> >>> Are you saying that your earlier comment below is not longer valid with >>> Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can >>> benefit from this? >>> >> >> No, the earlier comment is still valid. Just saying that in some cases >> exchange can be faster in the new Impala version. >> >> >>> " >>> *It's worth noting that, even if your table is replicated, Impala's >>> planner is unaware of this fact and it will give the same plan regardless. >>> That is to say, rather than every node scanning its local copy, instead a >>> single node will perform the whole scan (assuming it's a small table) and >>> broadcast it from there within the scope of a single query. So, I don't >>> think you'll see any performance improvements on Impala queries by >>> attempting something like an extremely high replication count.* >>> >>> *I could see bumping the replication count to 5 for these tables since >>> the extra storage cost is low and it will ensure higher availability of the >>> important central tables, but I'd be surprised if there is any measurable >>> perf impact.* >>> " >>> >>> On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon wrote: >>> >>>> Are you on the latest release of Impala? It switched from using Thrift >>>> for RPC to a new implementation (actually borrowed from kudu) which might >>>> help broadcast performance a bit. >>>> >>>> Todd >>>> >>>> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin >>>> wrote: >>>> >>>>> sorry to revive the old thread but I am curious if there is a good way >>>>> to speed up requests to frequently used tables in Kudu. >>>>> >>>>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin >>>>> wrote: >>>>> >>>>>> bummer..After reading your guys conversation, I wish there was an >>>>>> easier way...we will have the same issue as we have a few dozens of >>>>>> tables >>>>>> which are used very frequently in joins and I was hoping there was an >>>>>> easy >>>>>> way to replicate them on most of the nodes to avoid broadcasts every time >>>>>> >>>>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick < >>>>>> cresn...@mediamath.com> wrote: >>>>>> >>>>>>> The table in our case is 12x hashed and ranged by month, so the >>>>>>> broadcasts were often to all (12) nodes. >>>>>>> >>>>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal >>>>>>> wrote: >>>>>>> Sorry I left that out Cliff, FWIW it does seem to have been >>>>>>> broadcast.. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Not sure though how a shuffle would be much different from a >>>>>>> broadcast if entire table is 1 file/block in 1 node. >>>>>>> >>>>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick >>>>>>> wrote: >>>>>>> >>>>>>>> From the screenshot it does not look like there was a broadcast of >>>>>>>> the dimension table(s), so it could be the case here that the multiple >>>>>>>> smaller sends helps. Our dim tables are generally in the single-digit >>>>>>>> millions and Impala chooses to broadcast them. Since the fact result >>>>>>>> cardinality is always much smaller, we've found that forcing a >>>>>>>> [shuffle] >>>>>>>> dimension join is actually faster since it only sends dims once rather
Re: close Kudu client on timeout
On Thu, Jan 17, 2019 at 1:46 PM Boris Tyukin wrote: > Hi Alexey, > > it was "single idle Kudu Java client that created so many threads". > 20,000 threads in a few days to be precise :) that code is running > non-stop and basically listens to kafka topics, then for every batch from > kafka, we create new kudu client instance, upsert data and close client. > > the part we missed was *client.close()* in the end of that loop in the > code - once we put it in there, problem was solved. > > So it is hard to tell if it was Java GC or something else. > > But ideally, it would be nice, if Kudu server itself would kill idle > connections from clients on a timeout. I think Impala has similar global > setting. > > --rpc_default_keepalive_time_ms maybe it - I will look into this. > I don't think that will help. The Kudu client is built around Netty, which is an async networking framework that decouples threads from connections. That is to say, regardless of the TCP connections, each Kudu client that you create will create N netty worker threads, even when it has no TCP connections open. I do think it would make sense to have some sort of LOG.warn() if the KuduClient detects that there are more than 10 live clients or something, so that would make this issue more obvious. As for your use case, creating a new client for each batch seems somewhat heavyweight. Why are you doing that vs just creating a new session? -Todd > On Thu, Jan 17, 2019 at 2:51 PM Alexey Serbin > wrote: > >> Hi Boris, >> >> Kudu servers have a setting for connection inactivity period: idle >> connections to the servers will be automatically closed after the specified >> time (--rpc_default_keepalive_time_ms is the flag). So, from that >> perspective idle clients is not a big concern to the Kudu server side. >> >> As for your question, right now Kudu doesn't have a way to initiate a >> shutdown of an idle client from the server side. >> >> BTW, I'm curious what it was in your case you reported: were there too >> many idle Kudu client objects around created by the same application? Or >> that was something else, like a single idle Kudu Java client that created >> so many threads? >> >> >> Thanks, >> >> Alexey >> >> On Wed, Jan 16, 2019 at 1:31 PM Boris Tyukin >> wrote: >> >>> sorry it is Java >>> >>> On Wed, Jan 16, 2019 at 3:32 PM Mike Percy wrote: >>> >>>> Java or C++ / Python client? >>>> >>>> Mike >>>> >>>> Sent from my iPhone >>>> >>>> > On Jan 16, 2019, at 12:27 PM, Boris Tyukin >>>> wrote: >>>> > >>>> > Hi guys, >>>> > >>>> > is there a setting on Kudu server to close/clean-up inactive Kudu >>>> clients? >>>> > >>>> > we just found some rogue code that did not close client on code >>>> completion and wondering if we can prevent this in future on Kudu server >>>> level rather than relying on good developers. >>>> > >>>> > That code caused 22,000 threads opened on our edge node over the last >>>> few days. >>>> > >>>> > Boris >>>> >>>> -- Todd Lipcon Software Engineer, Cloudera
Re: trying to install kudu from source
r/share/kuduClient/cmake/kuduClientConfig.cmake > -- Munging kudu client targets in > /usr/share/kuduClient/cmake/kuduClientTargets-release.cmake > -- Munging kudu client targets in > /usr/share/kuduClient/cmake/kuduClientTargets.cmake > . > > > this is my install/docker script so far: > ROM debian:latest > # Install dependencies > RUN apt-get update && export DEBIAN_FRONTEND=noninteractive && \ > apt-get -y install apt-utils \ > aptitude \ > autoconf \ > automake \ > curl \ > dstat \ > emacs24-nox \ > flex \ > g++ \ > gcc \ > gdb \ > git \ > krb5-admin-server \ > krb5-kdc \ > krb5-user \ > libkrb5-dev \ > libsasl2-dev \ > libsasl2-modules \ > libsasl2-modules-gssapi-mit \ > libssl-dev \ > libtool \ > lsb-release \ > make \ > ntp \ > net-tools \ > openjdk-8-jdk \ > openssl \ > patch \ > python-dev \ > python-pip \ > python3-dev \ > python3 \ > python3-pip \ > pkg-config \ > python \ > rsync \ > unzip \ > vim-common \ > wget > > #Install Kudu > #RUN git clone https://github.com/apache/kudu \ > user@kudu.apache.orgWORKDIR / > RUN wget > https://www-us.apache.org/dist/kudu/1.8.0/apache-kudu-1.8.0.tar.gz > RUN mkdir -p /kudu && tar -xzf apache-kudu-1.8.0.tar.gz -C /kudu > --strip-components=1 > RUN ls / > > RUN cd /kudu \ > && thirdparty/build-if-necessary.sh > RUN cd /kudu && mkdir -p build/release \ > && cd /kudu/build/release \ > && ../../thirdparty/installed/common/bin/cmake -DCMAKE_BUILD_TYPE=release > -DCMAKE_INSTALL_PREFIX:PATH=/usr ../.. \ > && make -j4 > > RUN cd /kudu/build/release \ > && make install > > > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: strange behavior of getPendingErrors
Hey Alexey, I think your explanation makes sense from an implementation perspective. But, I think we should treat this behavior as a bug. From the user perspective, such an error is a per-row data issue and should only affect the row with the problem, not some arbitrary subset of rows in the batch which happened to share a partition. Does anyone disagree? Todd On Fri, Nov 16, 2018, 9:28 PM Alexey Serbin Hi Boris, > > Kudu clients (both Java and C++ ones) send write operations to > corresponding tablet servers in batches when using the AUTO_FLUSH_BACKGROUND > and MANUAL_FLUSH modes. When a tablet server receives a Write RPC > (WriteRequestPB is the corresponding type of the parameter), it decodes the > operations from the batch: > https://github.com/apache/kudu/blob/master/src/kudu/tablet/local_tablet_writer.h#L97 > > While decoding operations from a batch, various constraints are being > checked. One of those is checking for nulls in non-nullable columns. If > there is a row in the batch that violates the non-nullable constraint, the > whole batch is rejected. > > That's exactly what happened in your example: a batch to one tablet > consisted of 3 rows one of which had a row with violation of the > non-nullable constraint for the dt_tm column, so the whole batch of 3 > operations was rejected. You can play with different partition schemes: > e.g., in case of 10 hashed partitions it might happen that only 2 > operations would be rejected, in case of 30 partitions -- just the single > key==2 row could be rejected. > > BTW, that might also happen if using the MANUAL_FLUSH mode. However, with > the AUTO_FLUSH_SYNC mode, the client sends operations in batches of size 1. > > > Kind regards, > > Alexey > > On Fri, Nov 16, 2018 at 7:24 PM Boris Tyukin > wrote: > >> Hi Todd, >> >> We are on Kudu 1.5 still and I used Kudu client 1.7 >> >> Thanks, >> Boris >> >> On Fri, Nov 16, 2018, 17:07 Todd Lipcon > >>> Hi Boris, >>> >>> This is interesting. Just so we're looking at the same code, what >>> version of the kudu-client dependency have you specified, and what version >>> of the server? >>> >>> -Todd >>> >>> On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin >>> wrote: >>> >>>> Hey guys, >>>> >>>> I am playing with Kudu Java client (wow it is fast), using mostly code >>>> from Kudu Java example. >>>> >>>> While learning about exceptions during rows inserts, I stumbled upon >>>> something I could not explain. >>>> >>>> If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND >>>> mode) and I make one row to be "bad" intentionally (one column cannot be >>>> NULL), I actually get 3 rows that cannot be inserted into Kudu, not 1 as I >>>> was expected. >>>> >>>> But if I do session.flush() after every single insert, I get only one >>>> error row (but this ruins the purpose of AUTO_FLUSH_BACKGROUND mode). >>>> >>>> Any ideas one? We cannot afford losing data and need to track all rows >>>> which cannot be inserted. >>>> >>>> AUTO_FLUSH mode works much better and I do not have an issue like >>>> above, but then it is way slower than AUTO_FLUSH_BACKGROUND. >>>> >>>> My code is below. It is in Groovy, but I think you will get an idea :) >>>> https://gist.github.com/boristyukin/8703d2c6ec55d6787843aa133920bf01 >>>> >>>> Here is output from my test code that hopefully illustrates my >>>> confusion - out of 10 rows inserted, 9 should be good and 1 bad, but it >>>> turns out Kudu flagged 3 as bad: >>>> >>>> Created table kudu_groovy_example >>>> Inserting 10 rows in AUTO_FLUSH_BACKGROUND flush mode ... >>>> (int32 key=1, string value="value 1", unixtime_micros >>>> dt_tm=2018-11-16T20:57:03.469000Z) >>>> (int32 key=2, string value=NULL) BAD ROW >>>> (int32 key=3, string value="value 3", unixtime_micros >>>> dt_tm=2018-11-16T20:57:03.595000Z) >>>> (int32 key=4, string value=NULL, unixtime_micros >>>> dt_tm=2018-11-16T20:57:03.596000Z) >>>> (int32 key=5, string value="value 5", unixtime_micros >>>> dt_tm=2018-11-16T20:57:03.597000Z) >>>> (int32 key=6, string value=NULL, unixtime_micros >>>> dt_tm=2018-11-16T20:57:03.597000Z) >>>> (int32 key=7, string value="value 7", unixtime_micros >>>> dt_tm
Re: strange behavior of getPendingErrors
Hi Boris, This is interesting. Just so we're looking at the same code, what version of the kudu-client dependency have you specified, and what version of the server? -Todd On Fri, Nov 16, 2018 at 1:12 PM Boris Tyukin wrote: > Hey guys, > > I am playing with Kudu Java client (wow it is fast), using mostly code > from Kudu Java example. > > While learning about exceptions during rows inserts, I stumbled upon > something I could not explain. > > If I insert 10 rows into a brand new Kudu table (AUTO_FLUSH_BACKGROUND > mode) and I make one row to be "bad" intentionally (one column cannot be > NULL), I actually get 3 rows that cannot be inserted into Kudu, not 1 as I > was expected. > > But if I do session.flush() after every single insert, I get only one > error row (but this ruins the purpose of AUTO_FLUSH_BACKGROUND mode). > > Any ideas one? We cannot afford losing data and need to track all rows > which cannot be inserted. > > AUTO_FLUSH mode works much better and I do not have an issue like above, > but then it is way slower than AUTO_FLUSH_BACKGROUND. > > My code is below. It is in Groovy, but I think you will get an idea :) > https://gist.github.com/boristyukin/8703d2c6ec55d6787843aa133920bf01 > > Here is output from my test code that hopefully illustrates my confusion - > out of 10 rows inserted, 9 should be good and 1 bad, but it turns out Kudu > flagged 3 as bad: > > Created table kudu_groovy_example > Inserting 10 rows in AUTO_FLUSH_BACKGROUND flush mode ... > (int32 key=1, string value="value 1", unixtime_micros > dt_tm=2018-11-16T20:57:03.469000Z) > (int32 key=2, string value=NULL) BAD ROW > (int32 key=3, string value="value 3", unixtime_micros > dt_tm=2018-11-16T20:57:03.595000Z) > (int32 key=4, string value=NULL, unixtime_micros > dt_tm=2018-11-16T20:57:03.596000Z) > (int32 key=5, string value="value 5", unixtime_micros > dt_tm=2018-11-16T20:57:03.597000Z) > (int32 key=6, string value=NULL, unixtime_micros > dt_tm=2018-11-16T20:57:03.597000Z) > (int32 key=7, string value="value 7", unixtime_micros > dt_tm=2018-11-16T20:57:03.598000Z) > (int32 key=8, string value=NULL, unixtime_micros > dt_tm=2018-11-16T20:57:03.602000Z) > (int32 key=9, string value="value 9", unixtime_micros > dt_tm=2018-11-16T20:57:03.603000Z) > (int32 key=10, string value=NULL, unixtime_micros > dt_tm=2018-11-16T20:57:03.603000Z) > 3 errors inserting rows - why 3 only 1 expected to be bad... > there were errors inserting rows to Kudu > the first few errors follow: > ??? key 1 and 6 supposed to be fine! > Row error for primary key=[-128, 0, 0, 1], tablet=null, server=null, > status=Invalid argument: No value provided for required column: > dt_tm[unixtime_micros NOT NULL] (error 0) > Row error for primary key=[-128, 0, 0, 2], tablet=null, server=null, > status=Invalid argument: No value provided for required column: > dt_tm[unixtime_micros NOT NULL] (error 0) > Row error for primary key=[-128, 0, 0, 6], tablet=null, server=null, > status=Invalid argument: No value provided for required column: > dt_tm[unixtime_micros NOT NULL] (error 0) > Rows counted in 485 ms > Table has 7 rows - ??? supposed to be 9! > INT32 key=4, STRING value=NULL, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.596000Z > INT32 key=8, STRING value=NULL, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.602000Z > INT32 key=9, STRING value=value 9, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.603000Z > INT32 key=3, STRING value=value 3, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.595000Z > INT32 key=10, STRING value=NULL, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.603000Z > INT32 key=5, STRING value=value 5, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.597000Z > INT32 key=7, STRING value=value 7, UNIXTIME_MICROS > dt_tm=2018-11-16T20:57:03.598000Z > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: cannot import kudu.client
Do you happen to have a directory called 'kudu' in your working directory? Sometimes python gets confused and imports something you didn't expect. The output of 'kudu.__file__' might give you a clue. -Todd On Fri, Aug 31, 2018 at 3:27 PM, veto wrote: > i installed and compiled successfully kudo on jessie, stretch and used > dockers on centos and ubutu. > > on all i installed python2.7 and pip in kudu-pyton==1.7.1 and 1.2.0 > successfully. > > i could successfully import kudo but it fails to import kudo.client > > here is the log: > > > (env) root@boot2docker:~/kudu# python > Python 2.7.12 (default, Dec 4 2017, 14:50:18) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import kudu > >>> import kudu.client > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named client > -- Todd Lipcon Software Engineer, Cloudera
Re: Dictionary encoding
Hi Saeid, It's not based on the number of distinct values, but rather on the combined size of the values. I believe the default is 256kb, so assuming your strings are pretty short, a few thousand are likely to be able to be dict-encoded. Note that dictionaries are calculated per-rowset (small chunk of data) so even if your overall cardinality is much larger, if you have some spatial locality such that rows with nearby primary keys have fewer distinct values, then you're likely to get benefit here. -Todd On Sat, Aug 4, 2018 at 8:10 AM, Saeid Sattari wrote: > Hi Kudu community, > > Does any body know what is the maximum distinct values of a String column > that Kudu considers in order to set its encoding to Dictionary? Many thanks > :) > > br, > > -- Todd Lipcon Software Engineer, Cloudera
Re: Re: Recommended maximum amount of stored data per tablet server
On Thu, Aug 2, 2018 at 4:54 PM, Quanlong Huang wrote: > Thank Adar and Todd! We'd like to contribute when we could. > > Are there any concerns if we share the machines with HDFS DataNodes and > Yarn NodeManagers? The network bandwidth is 10Gbps. I think it's ok if they > don't share the same disks, e.g. 4 disks for kudu and the other 11 disks > for DataNode and NodeManager, and leave enough CPU & mem for kudu. Is that > right? > That should be fine. Typically we actualyl recommend sharing all the disks for all of the services. There is a trade-off between static partitioning (exclusive access to a smaller number of disks) vs dynamic sharing (potential contention but more available resources). Unless your workload is very latency sensitive I usually think it's better to have the bigger pool of resources available even if it needs to share with other systems. One recommendation, though is to consider using a dedicated disk for the Kudu WAL and metadata, which can help performance, since the WAL can be sensitive to other heavy workloads monopolizing bandwidth on the same spindle. -Todd > > At 2018-08-03 02:26:37, "Todd Lipcon" wrote: > > +1 to what Adar said. > > One tension we have currently for scaling is that we don't want to scale > individual tablets too large, because of problems like the superblock that > Adar mentioned. However, the solution of just having more tablets is also > not a great one, since many of our startup time problems are primarily > affected by the number of tablets more than their size (see KUDU-38 as the > prime, ancient, example). Additionally, having lots of tablets increases > raft heartbeat traffic and may need to dial back those heartbeat intervals > to keep things stable. > > All of these things can be addressed in time and with some work. If you > are interested in working on these areas to improve density that would be a > great contribution. > > -Todd > > > > On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo > wrote: > >> The 8TB limit isn't a hard one, it's just a reflection of the scale >> that Kudu developers commonly test. Beyond 8TB we can't vouch for >> Kudu's stability and performance. For example, we know that as the >> amount of on-disk data grows, node restart times get longer and longer >> (see KUDU-2014 for some ideas on how to improve that). Furthermore, as >> tablets accrue more data blocks, their superblocks become larger, >> raising the minimum amount of I/O for any operation that rewrites a >> superblock (such as a flush or compaction). Lastly, the tablet copy >> protocol used in rereplication tries to copy the entire superblock in >> one RPC message; if the superblock is too large, it'll run up against >> the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc). >> >> These examples are just off the top of my head; there may be others >> lurking. So this goes back to what I led with: beyond the recommended >> limit we aren't quite sure how Kudu's performance and stability are >> affected. >> >> All that said, you're welcome to try it out and report back with your >> findings. >> >> >> On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang >> wrote: >> > >> > Hi all, >> > >> > In the document of "Known Issues and Limitations", it's recommended >> that "maximum amount of stored data, post-replication and post-compression, >> per tablet server is 8TB". How is the 8TB calculated? >> > >> > We have some machines each with 15 * 4TB spinning disk drives and 256GB >> RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is >> recommended to leave for other systems? We prefer to make the machine >> dedicated to Kudu. Can tablet server leverage the whole space efficiently? >> > >> > Thanks, >> > Quanlong >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > > -- Todd Lipcon Software Engineer, Cloudera
Re: Recommended maximum amount of stored data per tablet server
+1 to what Adar said. One tension we have currently for scaling is that we don't want to scale individual tablets too large, because of problems like the superblock that Adar mentioned. However, the solution of just having more tablets is also not a great one, since many of our startup time problems are primarily affected by the number of tablets more than their size (see KUDU-38 as the prime, ancient, example). Additionally, having lots of tablets increases raft heartbeat traffic and may need to dial back those heartbeat intervals to keep things stable. All of these things can be addressed in time and with some work. If you are interested in working on these areas to improve density that would be a great contribution. -Todd On Thu, Aug 2, 2018 at 11:17 AM, Adar Lieber-Dembo wrote: > The 8TB limit isn't a hard one, it's just a reflection of the scale > that Kudu developers commonly test. Beyond 8TB we can't vouch for > Kudu's stability and performance. For example, we know that as the > amount of on-disk data grows, node restart times get longer and longer > (see KUDU-2014 for some ideas on how to improve that). Furthermore, as > tablets accrue more data blocks, their superblocks become larger, > raising the minimum amount of I/O for any operation that rewrites a > superblock (such as a flush or compaction). Lastly, the tablet copy > protocol used in rereplication tries to copy the entire superblock in > one RPC message; if the superblock is too large, it'll run up against > the default 50 MB RPC transfer size (see src/kudu/rpc/transfer.cc). > > These examples are just off the top of my head; there may be others > lurking. So this goes back to what I led with: beyond the recommended > limit we aren't quite sure how Kudu's performance and stability are > affected. > > All that said, you're welcome to try it out and report back with your > findings. > > > On Thu, Aug 2, 2018 at 7:23 AM Quanlong Huang > wrote: > > > > Hi all, > > > > In the document of "Known Issues and Limitations", it's recommended that > "maximum amount of stored data, post-replication and post-compression, per > tablet server is 8TB". How is the 8TB calculated? > > > > We have some machines each with 15 * 4TB spinning disk drives and 256GB > RAM, 48 cpu cores. Does it mean the other 52(= 15 * 4 - 8) TB space is > recommended to leave for other systems? We prefer to make the machine > dedicated to Kudu. Can tablet server leverage the whole space efficiently? > > > > Thanks, > > Quanlong > -- Todd Lipcon Software Engineer, Cloudera
Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb
On Wed, Aug 1, 2018 at 4:52 PM, Quanlong Huang wrote: > In my experience, when I found the performance is below my expectation, > I'd like to tune flags listed in https://kudu.apache.org/ > docs/configuration_reference.html , which needs a clear understanding of > kudu internals. Maybe we can add the link there? > > Any particular flags that you found you had to tune? I almost never advise tuning anything other than the number of maintenance threads. If you have some good guidance on how tuning those flags can improve performance, maybe we can consider changing the defaults or giving some more prescriptive advice? I'm a little nervous that saying "here are all the internals, and here are 100 config flags to study" will scare users more than help them :) -Todd > > At 2018-08-02 01:06:40,"Todd Lipcon" wrote: > > On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang > wrote: > >> Hi Todd and William, >> >> I'm really appreciated for your help and sorry for my late reply. I was >> going to reply with some follow-up questions but was assigned to focus some >> other works... Now I'm back to this work. >> >> The design docs are really helpful. Now I understand the flush and >> compaction. I think we can add a link to these design docs in the kudu >> documentation page, so users who want to dig deeper can know more about >> kudu internal. >> > > Personally, since starting the project, I have had the philosophy that the > user-facing documentation should remain simple and not discuss internals > too much. I found in some other open source projects that there isn't a > clear difference between user documentation and developer documentation, > and users can easily get confused by all of the internal details. Or, users > may start to believe that Kudu is very complex and they need to understand > knapsack problem approximation algorithms in order to operate it. So, > normally we try to avoid exposing too much of the details. > > That said, I think it is a good idea to add a small note in the > documentation somewhere that links to the design docs, maybe with some > sentence explaining that understanding internals is not necessary to > operate Kudu, but that expert users may find the internal design useful as > a reference? I would be curious to hear what other users think about how > best to make this trade-off. > > -Todd > > >> At 2018-06-15 23:41:17, "Todd Lipcon" wrote: >> >> Also, keep in mind that when the MRS flushes, it flushes into a bunch of >> separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by >> default). This is set by --budgeted_compaction_target_rowset_size >> >> However, increasing this size isn't likely to decrease the number of >> compactions, because each of these 32MB rowsets is non-overlapping. In >> other words, if your MRS contains rows A-Z, the output RowSets will include >> [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will >> never need to be compacted with each other. The net result, here, is that >> compaction becomes more fine-grained and only needs to operate on >> sub-ranges of the tablet where there is a lot of overlap. >> >> You can read more about this in docs/design-docs/compaction-policy.md, >> in particular the section "Limiting RowSet Sizes" >> >> Hope that helps >> -Todd >> >> On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley >> wrote: >> >>> The op seen in the logs is a rowset compaction, which takes existing >>> diskrowsets and rewrites them. It's not a flush, which writes data in >>> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset >>> compaction is done to reduce the amount of overlap of rowsets in primary >>> key space, i.e. reduce the number of rowsets that might need to be checked >>> to enforce the primary key constraint or find a row. Having lots of rowset >>> compaction indicates that rows are being written in a somewhat random order >>> w.r.t the primary key order. Kudu will perform much better as writes scale >>> when rows are inserted roughly in increasing order per tablet. >>> >>> Also, because you are using the log block manager (the default and only >>> one suitable for production deployments), there isn't a 1-1 relationship >>> between cfiles or diskrowsets and files on the filesystem. Many cfiles and >>> diskrowsets will be put together in a container file. >>> >>> Config parameters that might be relevant here: >>> --maintenance_manager_num_threads
Re: Re: Why RowSet size is much smaller than flush_threshold_mb
On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang wrote: > Hi Todd and William, > > I'm really appreciated for your help and sorry for my late reply. I was > going to reply with some follow-up questions but was assigned to focus some > other works... Now I'm back to this work. > > The design docs are really helpful. Now I understand the flush and > compaction. I think we can add a link to these design docs in the kudu > documentation page, so users who want to dig deeper can know more about > kudu internal. > Personally, since starting the project, I have had the philosophy that the user-facing documentation should remain simple and not discuss internals too much. I found in some other open source projects that there isn't a clear difference between user documentation and developer documentation, and users can easily get confused by all of the internal details. Or, users may start to believe that Kudu is very complex and they need to understand knapsack problem approximation algorithms in order to operate it. So, normally we try to avoid exposing too much of the details. That said, I think it is a good idea to add a small note in the documentation somewhere that links to the design docs, maybe with some sentence explaining that understanding internals is not necessary to operate Kudu, but that expert users may find the internal design useful as a reference? I would be curious to hear what other users think about how best to make this trade-off. -Todd > At 2018-06-15 23:41:17, "Todd Lipcon" wrote: > > Also, keep in mind that when the MRS flushes, it flushes into a bunch of > separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by > default). This is set by --budgeted_compaction_target_rowset_size > > However, increasing this size isn't likely to decrease the number of > compactions, because each of these 32MB rowsets is non-overlapping. In > other words, if your MRS contains rows A-Z, the output RowSets will include > [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will > never need to be compacted with each other. The net result, here, is that > compaction becomes more fine-grained and only needs to operate on > sub-ranges of the tablet where there is a lot of overlap. > > You can read more about this in docs/design-docs/compaction-policy.md, in > particular the section "Limiting RowSet Sizes" > > Hope that helps > -Todd > > On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley > wrote: > >> The op seen in the logs is a rowset compaction, which takes existing >> diskrowsets and rewrites them. It's not a flush, which writes data in >> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset >> compaction is done to reduce the amount of overlap of rowsets in primary >> key space, i.e. reduce the number of rowsets that might need to be checked >> to enforce the primary key constraint or find a row. Having lots of rowset >> compaction indicates that rows are being written in a somewhat random order >> w.r.t the primary key order. Kudu will perform much better as writes scale >> when rows are inserted roughly in increasing order per tablet. >> >> Also, because you are using the log block manager (the default and only >> one suitable for production deployments), there isn't a 1-1 relationship >> between cfiles or diskrowsets and files on the filesystem. Many cfiles and >> diskrowsets will be put together in a container file. >> >> Config parameters that might be relevant here: >> --maintenance_manager_num_threads >> --fs_data_dirs (how many) >> --fs_wal_dir (is it shared on a device with the data dir?) >> >> The metrics from the compact row sets op indicates the time is spent in >> fdatasync and in reading (likely reading the original rowsets). The overall >> compaction time is kinda long but not crazy long. What's the performance >> you are seeing and what is the performance you would like to see? >> >> -Will >> >> On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang >> wrote: >> >>> Hi all, >>> >>> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet >>> server, I find most of the compactions are compacting small files (~40MB >>> for each). For example: >>> >>> I0615 07:22:42.637351 30614 tablet.cc:1661] T >>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: >>> Compaction: stage 1 complete, picked 4 rowsets to compact >>> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to >>> compact: >>> I0615 07:22:42.637393 30614 compaction.cc:906]
Re: "broadcast" tablet replication for kudu?
On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin wrote: > Hi Todd, > > Are you saying that your earlier comment below is not longer valid with > Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can > benefit from this? > No, the earlier comment is still valid. Just saying that in some cases exchange can be faster in the new Impala version. > " > *It's worth noting that, even if your table is replicated, Impala's > planner is unaware of this fact and it will give the same plan regardless. > That is to say, rather than every node scanning its local copy, instead a > single node will perform the whole scan (assuming it's a small table) and > broadcast it from there within the scope of a single query. So, I don't > think you'll see any performance improvements on Impala queries by > attempting something like an extremely high replication count.* > > *I could see bumping the replication count to 5 for these tables since the > extra storage cost is low and it will ensure higher availability of the > important central tables, but I'd be surprised if there is any measurable > perf impact.* > " > > On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon wrote: > >> Are you on the latest release of Impala? It switched from using Thrift >> for RPC to a new implementation (actually borrowed from kudu) which might >> help broadcast performance a bit. >> >> Todd >> >> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin wrote: >> >>> sorry to revive the old thread but I am curious if there is a good way >>> to speed up requests to frequently used tables in Kudu. >>> >>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin >>> wrote: >>> >>>> bummer..After reading your guys conversation, I wish there was an >>>> easier way...we will have the same issue as we have a few dozens of tables >>>> which are used very frequently in joins and I was hoping there was an easy >>>> way to replicate them on most of the nodes to avoid broadcasts every time >>>> >>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick < >>>> cresn...@mediamath.com> wrote: >>>> >>>>> The table in our case is 12x hashed and ranged by month, so the >>>>> broadcasts were often to all (12) nodes. >>>>> >>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal >>>>> wrote: >>>>> Sorry I left that out Cliff, FWIW it does seem to have been >>>>> broadcast.. >>>>> >>>>> >>>>> >>>>> Not sure though how a shuffle would be much different from a broadcast >>>>> if entire table is 1 file/block in 1 node. >>>>> >>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick >>>>> wrote: >>>>> >>>>>> From the screenshot it does not look like there was a broadcast of >>>>>> the dimension table(s), so it could be the case here that the multiple >>>>>> smaller sends helps. Our dim tables are generally in the single-digit >>>>>> millions and Impala chooses to broadcast them. Since the fact result >>>>>> cardinality is always much smaller, we've found that forcing a [shuffle] >>>>>> dimension join is actually faster since it only sends dims once rather >>>>>> than >>>>>> all to all nodes. The degenerative performance of broadcast is especially >>>>>> obvious when the query returns zero results. I don't have much experience >>>>>> here, but it does seem that Kudu's efficient predicate scans can >>>>>> sometimes >>>>>> "break" Impala's query plan. >>>>>> >>>>>> -Cliff >>>>>> >>>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>>>> mauri...@impact.com> wrote: >>>>>> >>>>>>> @Todd not to belabor the point, but when I suggested breaking up >>>>>>> small dim tables into multiple parquet files (and in this thread's >>>>>>> context >>>>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>>>> was >>>>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>>>> >>>>>>> For example recently we ran into this slow query where the 14M >>>>>>> record dimension fit into a si
Re: "broadcast" tablet replication for kudu?
Impala 2.12. The external RPC protocol is still Thrift. Todd On Mon, Jul 23, 2018, 7:02 AM Clifford Resnick wrote: > Is this impala 3.0? I’m concerned about breaking changes and our RPC to > Impala is thrift-based. > > From: Todd Lipcon > Reply-To: "user@kudu.apache.org" > Date: Monday, July 23, 2018 at 9:46 AM > To: "user@kudu.apache.org" > Subject: Re: "broadcast" tablet replication for kudu? > > Are you on the latest release of Impala? It switched from using Thrift for > RPC to a new implementation (actually borrowed from kudu) which might help > broadcast performance a bit. > > Todd > > On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin wrote: > >> sorry to revive the old thread but I am curious if there is a good way to >> speed up requests to frequently used tables in Kudu. >> >> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin >> wrote: >> >>> bummer..After reading your guys conversation, I wish there was an easier >>> way...we will have the same issue as we have a few dozens of tables which >>> are used very frequently in joins and I was hoping there was an easy way to >>> replicate them on most of the nodes to avoid broadcasts every time >>> >>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick < >>> cresn...@mediamath.com> wrote: >>> >>>> The table in our case is 12x hashed and ranged by month, so the >>>> broadcasts were often to all (12) nodes. >>>> >>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal >>>> wrote: >>>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast.. >>>> >>>> >>>> >>>> Not sure though how a shuffle would be much different from a broadcast >>>> if entire table is 1 file/block in 1 node. >>>> >>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick >>>> wrote: >>>> >>>>> From the screenshot it does not look like there was a broadcast of the >>>>> dimension table(s), so it could be the case here that the multiple smaller >>>>> sends helps. Our dim tables are generally in the single-digit millions and >>>>> Impala chooses to broadcast them. Since the fact result cardinality is >>>>> always much smaller, we've found that forcing a [shuffle] dimension join >>>>> is >>>>> actually faster since it only sends dims once rather than all to all >>>>> nodes. >>>>> The degenerative performance of broadcast is especially obvious when the >>>>> query returns zero results. I don't have much experience here, but it does >>>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's >>>>> query plan. >>>>> >>>>> -Cliff >>>>> >>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>>> mauri...@impact.com> wrote: >>>>> >>>>>> @Todd not to belabor the point, but when I suggested breaking up >>>>>> small dim tables into multiple parquet files (and in this thread's >>>>>> context >>>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>>> was >>>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>>> >>>>>> For example recently we ran into this slow query where the 14M record >>>>>> dimension fit into a single file & block, so it got scanned on a single >>>>>> node though still pretty quickly (300ms), however it caused the join to >>>>>> take 25+ seconds and bogged down the entire query. See highlighted >>>>>> fragment and its parent. >>>>>> >>>>>> So we broke it into several small files the way I described in my >>>>>> previous post, and now join and query are fast (6s). >>>>>> >>>>>> -m >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon >>>>>> wrote: >>>>>> >>>>>>> I suppose in the case that the dimension table scan makes a >>>>>>> non-trivial portion of your workload time, then yea, parallelizing the >>>>>>> scan >>>>>>> as you suggest would be beneficial. That said,
Re: "broadcast" tablet replication for kudu?
Are you on the latest release of Impala? It switched from using Thrift for RPC to a new implementation (actually borrowed from kudu) which might help broadcast performance a bit. Todd On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin wrote: > sorry to revive the old thread but I am curious if there is a good way to > speed up requests to frequently used tables in Kudu. > > On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin > wrote: > >> bummer..After reading your guys conversation, I wish there was an easier >> way...we will have the same issue as we have a few dozens of tables which >> are used very frequently in joins and I was hoping there was an easy way to >> replicate them on most of the nodes to avoid broadcasts every time >> >> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick > > wrote: >> >>> The table in our case is 12x hashed and ranged by month, so the >>> broadcasts were often to all (12) nodes. >>> >>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal >>> wrote: >>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast.. >>> >>> >>> >>> Not sure though how a shuffle would be much different from a broadcast >>> if entire table is 1 file/block in 1 node. >>> >>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick wrote: >>> >>>> From the screenshot it does not look like there was a broadcast of the >>>> dimension table(s), so it could be the case here that the multiple smaller >>>> sends helps. Our dim tables are generally in the single-digit millions and >>>> Impala chooses to broadcast them. Since the fact result cardinality is >>>> always much smaller, we've found that forcing a [shuffle] dimension join is >>>> actually faster since it only sends dims once rather than all to all nodes. >>>> The degenerative performance of broadcast is especially obvious when the >>>> query returns zero results. I don't have much experience here, but it does >>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's >>>> query plan. >>>> >>>> -Cliff >>>> >>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>> mauri...@impact.com> wrote: >>>> >>>>> @Todd not to belabor the point, but when I suggested breaking up small >>>>> dim tables into multiple parquet files (and in this thread's context >>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>> was >>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>> >>>>> For example recently we ran into this slow query where the 14M record >>>>> dimension fit into a single file & block, so it got scanned on a single >>>>> node though still pretty quickly (300ms), however it caused the join to >>>>> take 25+ seconds and bogged down the entire query. See highlighted >>>>> fragment and its parent. >>>>> >>>>> So we broke it into several small files the way I described in my >>>>> previous post, and now join and query are fast (6s). >>>>> >>>>> -m >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon >>>>> wrote: >>>>> >>>>>> I suppose in the case that the dimension table scan makes a >>>>>> non-trivial portion of your workload time, then yea, parallelizing the >>>>>> scan >>>>>> as you suggest would be beneficial. That said, in typical analytic >>>>>> queries, >>>>>> scanning the dimension tables is very quick compared to scanning the >>>>>> much-larger fact tables, so the extra parallelism on the dim table scan >>>>>> isn't worth too much. >>>>>> >>>>>> -Todd >>>>>> >>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < >>>>>> mauri...@impactradius.com> wrote: >>>>>> >>>>>>> @Todd I know working with parquet in the past I've seen small >>>>>>> dimensions that fit in 1 single file/block limit parallelism of >>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread >>>>>>> across 20 or so blocks by leveraging SET PA
Re: Deleting from kudu table issue
Are you sure that the primary keys of the data that you are inserting are actually unique? You will see this error if your dataset has duplicates for that column. Perhaps you can share "show create table odata_uat.od_policy_fee;"? On Wed, Jul 11, 2018 at 8:16 PM, wangji...@ehuatai.com < wangji...@ehuatai.com> wrote: > Hi there ! > I'm facing a problem while using kudu with impala .I run a SQL script on > impala which includes these operations : > 1.Delete from odata_uat.od_policy_fee ; > 2.insert into odata_uat.od_policy_fee select * > from odata_uat.od_policy_fee_his; > When i'm doing the second step to insert data into the table which is > supposed to be empty,a warning comes : > > WARNINGS: Key already present in Kudu table 'impala::odata_uat.od_policy_fee'. > (1 of 538 similar) > > which means I lost 1 record from odata_uat.od_policy_fee_his .This > happens occasionally and I dunno what cause . > > environment: CDH 5.12.0kudu 1.4.0 > > I have 1 kudu master with 16 cores and 128g RAM each, 4 TServers with 16 > cores with 256g RAM each. > > how can I fix this or avoid it . > > Thanks ! > > Best regards . > > > -- > > wang jiaxi > -- Todd Lipcon Software Engineer, Cloudera
Re: spark on kudu performance!
On Mon, Jun 11, 2018 at 5:52 AM, fengba...@uce.cn wrote: > Hi: > > I use kudu official website development documents, use > spark analysis kudu data(kudu's version is 1.6.0): > > the official code is : > *val df = sqlContext.read.options(Map("kudu.master" -> > "kudu.master:7051","kudu.table" -> "kudu_table")).kudu // Query using the > Spark API... df.select("id").filter("id" >= 5).show()* > > > My question is : > (1)If I use the official website code, when creating > data collection of df, the data of my table is about 1.8 > billion, and then the filter of df is performed. This is > equivalent to loading 1.8 billion data into memory each > time, and the performance is very poor. > That's not correct. Data frames are lazy-evaluated, so when you use a filter like the above, it does not fully materialize the whole data frame into memory before it begins to filter. You can also use ".explain()" to see whether the filter you are specifying is getting pushed down properly to Kudu. > > (2)Create a time-based range partition on the 1.8 billion > table, and then directly use the underlying java api,scan > partition to analyze, this is not the amount of data each > time loading is the specified number of partitions instead > of 1.8 billion data? > > Please give me some suggestions, thanks! > > The above should happen automatically so long as the filter predicate has been pushed down. Using 'explain()' and showing us the results, along with the code you used to create your table, will help understand what might be the problem with performance. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Adding a new kudu master
On Mon, Jul 2, 2018 at 4:36 AM, Sergejs Andrejevs wrote: > Hi there, > > > > I am very glad Kudu is evolving so rapidly. Thanks for your contributions! > > > > I have faced with a challenge: > > I need to upgrade (reinstall) prod servers, where 3 Kudu masters are > running. What would be the best way to do it from Kudu perspective? > Will you be changing the hostnames of the servers or just reinstalling the OS but otherwise keeping the same configuration? Have you partitioned the servers with separate OS and data disks? If so, you can likely just reinstall the OS without reformatting the data disks. When the OS has been reinstalled, simply install Kudu again, use the same configuration as before to point at the existing data directories, and everything should be fine. > If it is not officially supported yet, could you advise a way, which > minimizes the risks? > > > > Environment/conditions: > > Cloudera 5.14 > > Kudu 1.6 > > High-level procedure: remove 1 server from cluster, upgrade, return back > to CM cluster, check, proceed with the next server. > > Some downtime is possible (let’s say < 1h) > I can't give any particular advise on how this might interact with Cloudera Manager. I think the Cloudera community forum probably is a more appropriate spot for that. But, from a Kudu-only perspective, it should be fine to have a mixed-OS cluster where one master has been upgraded before the others. Just keep the data around when you reinstall. > > > Approach: > > I have already tried out at test cluster the steps, which were used to > migrate from a single-master to multi-master cluster (see the plan below). > However, there was a remark not to use it in order to add new nodes for 3+ > master cluster. > Therefore, what could be an alternative way? If no alternatives, what > could be the extra steps to pay additional attention to check the status if > Kudu cluster is in a good shape? > Any comments/suggestions are extremely appreciated as well. > > > > Current plan: > > 0. Cluster check > > 1. Stop all masters (let’s call them master-1, master-2, master-3). > > 2. Remove from CM one Kudu master, e.g. master-3. > > 3. Update raft meta by removing “master-3” from Kudu cluster (to be > able to restart Kudu): > *sudo -u kudu kudu local_replica cmeta rewrite_raft_config > 1234567890:master-1:7051 > 0987654321:master-2:7051* > By the way, do I understand right that tablet_id > is a special, containing cluster meta > info? > > 4. Start all masters. From now Kudu temporary consists of 2 masters. > Why bother removing the master that's down? If you can keep its data around, and it will come back with the same hostname, there's no need to remove it. You could simply shut down the node and be running with 2/3 servers up, which would give you the same reliability as using 2/2 without the extra steps. > 5. Cluster check. > > 6. Upgrade the excluded server > > 7. Stop all masters. > > 8. Prepare “master-3” as Kudu master: > > *sudo -u kudu kudu fs format --fs_wal_dir=… --fs_data_dirs=… sudo -u kudu > kudu fs dump uuid --fs_wal_dir=… --fs_data_dirs=… 2>/dev/null* > Let’s say obtained id is 77. > Add master-3 to CM. > > 9. Run metainfo update at existing masters, i.e. master-1 and > master-2: > > *sudo -u kudu kudu local_replica cmeta rewrite_raft_config > 1234567890:master-1:7051 > 0987654321:master-2:7051 77:master-3:7051* > > 10. Start one master, e.g. master-1. > > Copy the current cluster state from master-1 to master-3: > *sudo -u kudu kudu local_replica copy_from_remote --fs_wal_dir=… > --fs_data_dirs=… 1234567890:master-1:7051* > > 11. Start remaining Kudu masters: master-2 and master-3. > > 12. Cluster check. > > > > * Optionally, at first there may be added 1 extra node (to increase from 3 > to 4 the initial number of Kudu masters, so that after removal of 1 node > there are still HA with quorum of 3 masters). In this case steps 7-12 > should be repeated and additionally HiveMetaStore update should be executed: > > *UPDATE hive_meta_store_database.TABLE_PARAMS* > > *SET PARAM_VALUE = 'master-1,master-2,master-3'* > > *WHERE PARAM_KEY = 'kudu.master_addresses' AND PARAM_VALUE = > 'master-1,master-2,master-3,master-4';* > > After upgrades, the master-4 node to be removed by running > steps 1-5. > > > > Thanks! > > > > Best regards, > > *Sergejs Andrejevs* > > Information about how we process personal data > <http://www.intrum.com/privacy> > > > -- Todd Lipcon Software Engineer, Cloudera
Re: kudu Insert、Update、Delete operating data lost
Hi, I'm having trouble understanding your question. Can you give an example of the operations you are trying and why you believe data is being lost? -Todd On Thu, Jun 14, 2018 at 8:24 PM, 秦坤 wrote: > hello: > I use java scan api to operate kudu in large batches > If a session contains Insert, Update, Delete operations, if > the database does not exist in the data there will be > some new data loss, how to avoid such problems. > -- Todd Lipcon Software Engineer, Cloudera
Re: Why RowSet size is much smaller than flush_threshold_mb
1556616208384})}] >> I0615 07:22:55.189757 30614 tablet.cc:1631] T >> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: >> Compaction successful on 82987 rows (123387929 bytes) >> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent >> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628s user >> 1.460s sys 0.410s >> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P >> 70f3e54fe0f3490cbf0371a6830a33a7: >> CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) >> metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfi >> le_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data >> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data >> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us" >> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms >> <https://maps.google.com/?q=1-10_ms+:+32&entry=gmail&source=g>":32," >> lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_ >> time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us": >> 25,"spinlock_wait_cycles":155264,"tcmalloc_contention_ >> cycles":768,"thread_start_us":677,"threads_started":14,"wal- >> append.queue_time_us":300} >> >> The flush_threshold_mb is set in the default value (1024). Wouldn't the >> flushed file size be ~1GB? >> >> I think increasing the initial RowSet size can reduce compactions and >> then reduce the impact of other ongoing operations. It may also improve the >> flush performance. Is that right? If so, how can I increase the RowSet size? >> >> I'd be grateful if someone can make me clear about these! >> >> Thanks, >> Quanlong >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Java KuduClient table references
Hi Philipp, What you're seeing is expected -- internally, after you open a table, the client will use that table's internal UUID rather than its name to perform operations. This is similar semantics to opening a file and then renaming it on a POSIX filesystem. Creating a new KuduClient instance should definitely result in new Table instances. There is no singleton or cross-client-instance caching going on. In fact, even re-opening the table from an existing client instance should return a new KuduTable instance, since we don't cache the table->tableId mapping as far as I can recall. Are you sure that you properly called 'openTable' to get a new KuduTable instance, and are using that new KuduTable to create the new scanners? -Todd On Tue, Jun 5, 2018 at 1:50 AM, Philipp-A Hoffmann < philipp-a.hoffm...@postbank.de> wrote: > Hi there, > > I'm experiencing the following issue with the Java Kudu API: > > If I'm re-creating my tables (no schema changes) in Impala while my Kudu > application is running, Kudu keeps the old table references. That results > in a warning ("org.apache.kudu.client.AsyncKuduScanner: Can not open > scanner") followed by an exception ("NonRecoverableException: The table was > deleted: Table deleted at ...") while accessing the table. I've tried to > create a new KuduClient instance using the KuduClientBuilder, but it's > still getting the old table references. Is there any way to clear these > references? > > Kind regards, > Philipp > > Die Europäische Kommission hat unter http://ec.europa.eu/consumers/odr/ > eine Europäische Online-Streitbeilegungsplattform (OS-Plattform) > errichtet. Verbraucher können die OS-Plattform für die außergerichtliche > Beilegung von Streitigkeiten aus Online-Verträgen mit in der EU > niedergelassenen Unternehmen nutzen. > > Informationen (einschließlich Pflichtangaben) zu einzelnen, innerhalb der > EU tätigen Gesellschaften und Zweigniederlassungen der DB Privat- und > Firmenkundenbank AG finden Sie unter https://www.postbank.de/pflich > tangaben. Diese E-Mail enthält vertrauliche und/ oder rechtlich > geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder > diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den > Absender und vernichten Sie diese E-Mail. Das unerlaubte Kopieren sowie die > unbefugte Weitergabe dieser E-Mail ist nicht gestattet. > > The European Commission has established a European online dispute > resolution platform (OS platform) under http://ec.europa.eu/consumers/odr/. > Consumers may use the OS platform to resolve disputes arising from online > contracts with providers established in the EU. > > Please refer to https://www.postbank.com/disclosures for information > (including mandatory corporate particulars) on selected branches and group > companies registered or incorporated in the European Union. This e-mail may > contain confidential and/or privileged information. If you are not the > intended recipient (or have received this e-mail in error) please notify > the sender immediately and delete this e-mail. Any unauthorized copying, > disclosure or distribution of the material in this e-mail is strictly > forbidden. > > -- Todd Lipcon Software Engineer, Cloudera
Re: How to install the latest kudu release and find the compatible Impala versions?
On Mon, May 21, 2018 at 4:37 PM, Quanlong Huang wrote: > Hi friends, > > We're trying to benchmark Impala+kudu to compare with other lambda > architectures like Druid. So we hope we can install the latest release > version of Impala (2.12.0) and kudu (1.7.0). However, when following the > installation guide in https://kudu.apache.org/docs/installation.html, we > can only install kudu-1.6.0-cdh5.14.2. Is it possible to install kudu-1.7 > without manual compilation? > That's right -- the installation guide there is just provided as a convenience link to a vendor who provides some binary artifacts. The Apache Kudu project itself only releases source artifacts at this point in time. You'll need to compile manually if you want a binary artifact for your particular operating system. It looks like someone on github has made RPMs available here: https://github.com/MartinWeindel/kudu-rpm . Perhaps this would work for your system? However, note that, per my email on the Impala list, impalad needs to have a libkudu_client from the 'native-toolchain' project so that it is built with the same toolchain as Impala. So, you'll want to use the kudu client bundled with your Impala build and point it at the Kudu server from your own build or the above RPM. > > Besides, I notice that Impala-2.5 is not compatible with kudu-1.6.0 since > the CREATE TABLE syntax for kudu is not recognized. Here's the error: > > Query: create TABLE my_first_table > ( > id BIGINT, > name STRING, > PRIMARY KEY(id) > ) > PARTITION BY HASH PARTITIONS 16 > STORED AS KUDU > TBLPROPERTIES ( > 'kudu.master_addresses' = 'lascorehadoop-15d26' > ) > ERROR: AnalysisException: Syntax error in line 5: > PRIMARY KEY(id) > ^ > Encountered: IDENTIFIER > Expected: ARRAY, BIGINT, BINARY, BOOLEAN, CHAR, DATE, DATETIME, DECIMAL, > REAL, FLOAT, INTEGER, MAP, SMALLINT, STRING, STRUCT, TIMESTAMP, TINYINT, > VARCHAR > > CAUSED BY: Exception: Syntax error > > > Right, I don't recall whether Impala 2.5 supported Kudu at all. If it did, it was a very early version, and the syntax has since changed. For the purpose of benchmarks I would definitely recommend using the latest versions available. > My further questions are > >- Is there a compatibility matrix for Impala and Kudu? > > We don't maintain any such matrix as part of the Apache projects. Doing so would require a lot of testing of multiple versions and it's enough of a time commitment that I don't think anyone has put in the work outside of commercial "downstream" vendors. > >- Is Impala-2.12 compatible with Kudu-1.6.0 and Kudu-1.7.0? > > Kudu itself has maintained wire compatibility, so you should be able to point an Impala 2.12 cluster at either Kudu 1.6 or Kudu 1.7 clusters with success. As above, you'll need to make sure you're using the libkudu_client.so that's built with Impala's toolchain to avoid ABI-related crashes, but that's not a compatibility issue so much as a quirk of how Impala's build works. -Todd
Re: will upsert have bad effect on scan performace?
Hi Andy, An upsert of a row that does not exist is exactly the same as an insert. You can think of upsert as: try { insert the row } catch (Already Exists) { update the row } In reality, the conversion from insert to update is a bit more efficient compared to doing the above yourself (and it's atomic). But, in terms of performance, once the row has been inserted, it is the same as any other row. -Todd On Mon, May 21, 2018 at 3:14 AM, Andy Liu wrote: > Thanks in advance. > hi, i have used java upsert api to load data instead of insert api. > will it have a bad effect even though these data were firstly loaded. > i do not know compaction mechanism of kudu, will it lead to many > compaction, thus lead to bad scan performance. > > Best regards. > -- Todd Lipcon Software Engineer, Cloudera
Re: scan performance super bad
t;, > PARTITION "61" <= VALUES < "615000", > PARTITION "615000" <= VALUES < "62", > PARTITION "62" <= VALUES < "625000", > PARTITION "625000" <= VALUES < "63", > PARTITION "63" <= VALUES < "635000", > PARTITION "635000" <= VALUES < "64", > PARTITION "64" <= VALUES < "645000", > PARTITION "645000" <= VALUES < "65", > PARTITION "65" <= VALUES < "655000", > PARTITION "655000" <= VALUES < "66", > PARTITION "66" <= VALUES < "665000", > PARTITION "665000" <= VALUES < "67", > PARTITION "67" <= VALUES < "675000", > PARTITION "675000" <= VALUES < "68", > PARTITION "68" <= VALUES < "685000", > PARTITION "685000" <= VALUES < "69", > PARTITION "69" <= VALUES < "695000", > PARTITION "695000" <= VALUES < "70", > PARTITION "70" <= VALUES < "705000", > PARTITION "705000" <= VALUES < "71", > PARTITION "71" <= VALUES < "715000", > PARTITION "715000" <= VALUES < "72", > PARTITION "72" <= VALUES < "725000", > PARTITION "725000" <= VALUES < "73", > PARTITION "73" <= VALUES < "735000", > PARTITION "735000" <= VALUES < "74", > PARTITION "74" <= VALUES < "745000", > PARTITION "745000" <= VALUES < "75", > PARTITION "75" <= VALUES < "755000", > PARTITION "755000" <= VALUES < "76", > PARTITION "76" <= VALUES < "765000", > PARTITION "765000" <= VALUES < "77", > PARTITION "77" <= VALUES < "775000", > PARTITION "775000" <= VALUES < "78", > PARTITION "78" <= VALUES < "785000", > PARTITION "785000" <= VALUES < "79", > PARTITION "79" <= VALUES < "795000", > PARTITION "795000" <= VALUES < "80", > PARTITION "80" <= VALUES < "805000", > PARTITION "805000" <= VALUES < "81", > PARTITION "81" <= VALUES < "815000", > PARTITION "815000" <= VALUES < "82", > PARTITION "82" <= VALUES < "825000", > PARTITION "825000" <= VALUES < "83", > PARTITION "83" <= VALUES < "835000", > PARTITION "835000" <= VALUES < "84", > PARTITION "84" <= VALUES < "845000", > PARTITION "845000" <= VALUES < "85", > PARTITION "85" <= VALUES < "855000", > PARTITION "855000" <= VALUES < "86", > PARTITION "86" <= VALUES < "865000", > PARTITION "865000" <= VALUES < "87", > PARTITION "87" <= VALUES < "875000", > PARTITION "875000" <= VALUES < "88", > PARTITION "88" <= VALUES < "885000", > PARTITION "885000" <= VALUES < "89", > PARTITION "89" <= VALUES < "895000", > PARTITION "895000" <= VALUES < "90", > PARTITION "90" <= VALUES < "905000", > PARTITION "905000" <= VALUES < "91", > PARTITION "91" <= VALUES < "915000", > PARTITION "915000" <= VALUES < "92", > PARTITION "92" <= VALUES < "925000", > PARTITION "925000" <= VALUES < "93", > PARTITION "93" <= VALUES < "935000", > PARTITION "935000" <= VALUES < "94", > PARTITION "94" <= VALUES < "945000", > PARTITION "945000" <= VALUES < "95", > PARTITION "95" <= VALUES < "955000", > PARTITION "955000" <= VALUES < "96", > PARTITION "96" <= VALUES < "965000", > PARTITION "965000" <= VALUES < "97", > PARTITION "97" <= VALUES < "975000", > PARTITION "975000" <= VALUES < "98", > PARTITION "98" <= VALUES < "985000", > PARTITION "985000" <= VALUES < "99", > PARTITION "99" <= VALUES < "995000", > PARTITION VALUES >= "995000" > ) > > > So it looks like you have a numeric value being stored here in the string column. Are you sure that you are properly zero-padding when creating your key? For example if you accidentally scan from "50_..." to "80_..." you will end up scanning a huge portion of your table. > i did not delete rows in this table ever. > > my scanner code is below: > buildKey method will build the lower bound and the upper bound, the unique > id is same, the startRow offset(third part) is 0, and the endRow offset is > , startRow and endRow only differs from time. > though the max offset is big(999), generally it is less than 100. > > private KuduScanner buildScanner(Metric startRow, Metric endRow, > List dimensionIds, List dimensionFilterList) { > KuduTable kuduTable = > kuduService.getKuduTable(BizConfig.parseFrom(startRow.getBizId())); > > PartialRow lower = kuduTable.getSchema().newPartialRow(); > lower.addString("key", buildKey(startRow)); > PartialRow upper = kuduTable.getSchema().newPartialRow(); > upper.addString("key", buildKey(endRow)); > > LOG.info("build scanner. lower = {}, upper = {}", buildKey(startRow), > buildKey(endRow)); > > KuduScanner.KuduScannerBuilder builder = > kuduService.getKuduClient().newScannerBuilder(kuduTable); > builder.setProjectedColumnNames(COLUMNS); > builder.lowerBound(lower); > builder.exclusiveUpperBound(upper); > builder.prefetching(true); > builder.batchSizeBytes(MAX_BATCH_SIZE); > > if (CollectionUtils.isNotEmpty(dimensionFilterList)) { > for (int i = 0; i < dimensionIds.size() && i < MAX_DIMENSION_NUM; > i++) { > for (DimensionFilter dimensionFilter : dimensionFilterList) { > if (!Objects.equals(dimensionFilter.getDimensionId(), > dimensionIds.get(i))) { > continue; > } > ColumnSchema columnSchema = > kuduTable.getSchema().getColumn(String.format("dimension_%02d", i)); > KuduPredicate predicate = buildKuduPredicate(columnSchema, > dimensionFilter); > if (predicate != null) { > builder.addPredicate(predicate); > LOG.info("add predicate. predicate = {}", > predicate.toString()); > } > } > } > } > return builder.build(); > } > > What client version are you using? 1.7.0? > i checked the metrics, only get content below, it seems no relationship > with my table. > Looks like you got the metrics from the kudu master, not a tablet server. You need to figure out which tablet server you are scanning and grab the metrics from that one. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: 答复: Issue in data loading in Impala + Kudu
gt;> @ 0x1bbcb31 impala::PrintErrorMap() >> >> @ 0x1bbcd07 impala::PrintErrorMapToString() >> >> @ 0x2decbd7 impala::Coordinator::GetErrorLog() >> >> @ 0x1a8d634 impala::ImpalaServer::UnregisterQuery() >> >> @ 0x1b29264 impala::ImpalaServer::CloseOperation() >> >> @ 0x2c5ce86 apache::hive::service::cli::th >> rift::TCLIServiceProcessor::process_CloseOperation() >> >> @ 0x2c56b8c apache::hive::service::cli::th >> rift::TCLIServiceProcessor::dispatchCall() >> >> @ 0x2c2fcb1 impala::ImpalaHiveServer2Servi >> ceProcessor::dispatchCall() >> >> @ 0x16fdb20 apache::thrift::TDispatchProcessor::process() >> >> @ 0x18ea6b3 apache::thrift::server::TAccep >> tQueueServer::Task::run() >> >> @ 0x18e2181 impala::ThriftThread::RunRunnable() >> >> @ 0x18e3885 boost::_mfi::mf2<>::operator()() >> >> @ 0x18e371b boost::_bi::list3<>::operator()<>() >> >> @ 0x18e3467 boost::_bi::bind_t<>::operator()() >> >> @ 0x18e337a boost::detail::function::void_ >> function_obj_invoker0<>::invoke() >> >> @ 0x192761c boost::function0<>::operator()() >> >> @ 0x1c3ebf7 impala::Thread::SuperviseThread() >> >> @ 0x1c470cd boost::_bi::list5<>::operator()<>() >> >> @ 0x1c46ff1 boost::_bi::bind_t<>::operator()() >> >> @ 0x1c46fb4 boost::detail::thread_data<>::run() >> >> @ 0x2eedb4a thread_proxy >> >> @ 0x7fda1dbb16ba start_thread >> >> @ 0x7fda1d8e741d clone >> >> Wrote minidump to /tmp/minidumps/impalad/a9113d9 >> b-bc3d-488a-1feebf9b-47b42022.dmp >> >> >> >> *impalad.FATAL* >> >> >> >> Log file created at: 2018/05/07 09:46:12 >> >> Running on machine: slave2 >> >> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >> >> F0507 09:46:12.673912 29258 error-util.cc:148] Check failed: >> log_entry.count > 0 (-1831809966 vs. 0) >> >> >> >> *Impalad.INFO* >> >> edentials={real_user=root}} blocked reactor thread for 34288.6us >> >> I0507 09:38:14.943245 29882 outbound_call.cc:288] RPC callback for RPC >> call kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050 >> (slave5), user_credentials={real_user=root}} blocked reactor thread for >> 35859.8us >> >> I0507 09:38:15.942150 29882 outbound_call.cc:288] RPC callback for RPC >> call kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050 >> (slave5), user_credentials={real_user=root}} blocked reactor thread for >> 40664.9us >> >> I0507 09:38:17.495046 29882 outbound_call.cc:288] RPC callback for RPC >> call kudu.tserver.TabletServerService.Write -> {remote=136.243.74.42:7050 >> (slave5), user_credentials={real_user=root}} blocked reactor thread for >> 49514.6us >> >> I0507 09:46:12.664149 4507 coordinator.cc:783] Release admission control >> resources for query_id=3e4a4c646800e1d9:c859bb7f >> >> F0507 09:46:12.673912 29258 error-util.cc:148] Check failed: >> log_entry.count > 0 (-1831809966 vs. 0) >> >> Wrote minidump to /tmp/minidumps/impalad/a9113d9 >> b-bc3d-488a-1feebf9b-47b42022.dmp >> >> >> >> *Note*: >> >> We are executing the queries on 8 node cluster with the following >> configuration >> >> Cluster : 8 Node Cluster (48 GB RAM , 8 CPU Core and 2 TB hard-disk each, >> Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz >> >> >> >> >> >> -- >> >> Regards, >> >> Geetika Gupta >> > > > > -- > Regards, > Geetika Gupta > -- Todd Lipcon Software Engineer, Cloudera
Re: scan performance super bad
Can you share the code you are using to create the scanner and call nextRows? Can you also copy-paste the info provided on the web UI of the kudu master for this table? It will show the schema and partitioning information. Is it possible that your table includes a lot of deleted rows? i.e did you load the table, then delete all the rows, then load again? This can cause some performance issues in current versions of Kudu as the scanner needs to "skip over" the deleted rows before it finds any to return. Based on your description I would expect this to be doing a simple range scan for the returned rows, and return in just a few milliseconds. The fact that it is taking 500ms implies that the server is scanning a lot of non-matching rows before finding a few that match. You can also check the metrics: http://my-tablet-server:8050/metrics?metrics=scanner_rows and compare the 'rows scanned' vs 'rows returned' metric. Capture the values both before and after you run the query, and you should see if 'rows_scanned' is much larger than 'rows_returned'. -Todd On Sun, May 13, 2018 at 12:56 AM, 一米阳光 <710339...@qq.com> wrote: > hi, i have faced a difficult problem when using kudu 1.6. > > my kudu table schema is generally like this: > column name:key, type:string, prefix encoding, lz4 compression, primary key > column name:value, type:string, lz4 compression > > the primary key is built from several parts: > 001320_201803220420_0001 > the first part is a unique id, > the second part is time format string, > the third part is incremental integer(for a unique id and an fixed time, > there may exist multi value, so i used this part to distinguish > <http://dict.youdao.com/w/distinguish/#keyfrom=E2Ctranslation>) > > the table range partition use the first part, split it like below > range<005000 > 005000<= range <01 > 01<= range <015000 > 015000<= range <02 > . > . > 995000<= range > > when i want to scan data for a unique id and range of time, the lower > bound like 001320_201803220420_0001 and the higher bound like > 001320_201803230420_, it takes about 500ms to call > kuduScanner.nextRows() and the number of rows it returns is between 20~50. > All size of data between the bound is about 8000, so i should call hundreds > times nextRows() to fetch all data, and it finally cost several minutes. > > i don't know why this happened and how to resolve itmaybe the final > solution is that i should giving up kudu, using hbase instead... > -- Todd Lipcon Software Engineer, Cloudera
Re: Kudu read - performance issue
On Fri, May 11, 2018 at 12:05 PM, Todor Petrov wrote: > Hi there, > > > > I have an interesting performance issue reading from Kudu. Hopefully there > is a good explanation for it because the difference in the performance is > quite significant and it puzzles me a lot. > > > > Basically we have a table with the following schema: > > > > *Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION* > > *Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION* > > *…. (a bunch of int32 and int16 columns)* > > > > *PK is (Column1, Column2)* > > *HASH(Column1) PARTITIONS 4* > > > > The number of records is *~60M*. *~5K* distinct Column1 values. *~1.4M* > distinct values for Column2. > > > > All tests are made on one core. I think the hardware specs are not > important. > > > > 1) If we query all data using > > > > * val scanner = * > > *kuduClient.getAsyncScannerBuilder(table)* > > * > .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, > ComparisonOp.EQUAL, column1Value)).build()* > > > > We use 3 scanners in parallel (one query for each unique value of column1). > > > > All fields from the returned rows are read and some internal structures > are built. > > > > In this case, it takes *~40 sec* to load all the data. > > > > 2) If we query using “InListPredicate”, then the performance is > super slow. > > > > * val scanner = * > > *kuduClient.getAsyncScannerBuilder(table)* > > * > .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, > ComparisonOp.EQUAL, column1Value))* > > * .addPredicate(KuduPredicate.newInListPredicate(Column2Schema, > column2Values.asJava)).build()* > > > > Same as in 1), 3 scanners in parallel, all records are read and some > in-memory structures are built. This time column2 values are split into a > bunch of chunks and we send a request for each unique value of column1 and > each chunk of column2 values. > Are you sorting the values of 'column2' before doing the chunking? Kudu doesn't use indexes for evaluating IN-list predicates except for using the min(in-list-values) and max(in-list-values). So, if you had for example: pre-chunk in-list: 1,2,3,4,5,6 chunk 1: col2 IN (1,6) chunk 2: col2 IN (2,5) chunk 3: col2 IN (3,4) then you will actually scan over the middle portion of that table 3 times. If you sort the in-list before chunking you'll avoid the multiple-scan effect here. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Column Compression and Encoding
On Tue, May 8, 2018 at 9:25 AM, Saeid Sattari wrote: > Hi Todd, > > Thanks for these tips. Does compressing (LZ4,..) primary key's columns > cause performance loss? > If you have a composite primary key, Kudu already creates an internal combined column for their encoded concatenation. That internal column is already automatically compressed using PREFIX_ENCODING (because it's stored sorted, this is almost always a win) and using LZ4 (because there may be compressible patterns in non-prefix components of the composite key). So, if a column is part of the PK but not the entire PK, it will only be used on the read path when that actual column is selected, and it has the same performance impact (positive or negative) as any other column in the row. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Column Compression and Encoding
Hi Saeid, We've tried to make the default compression/encoding a reasonable tradeoff of performance for most common workloads. A couple quick tips I've found from my experiments: - high-cardinality strings won't be automatically compressed by dictionaries. So, if you have such a large string that might have repeated substrings (eg a set of URLs) then enabling LZ4 compression is a good idea. - if you have strings with a lot of common prefixes, you might consider PREFIX_ENCODING - for integer types, choose the smallest size that fits your intended range. eg don't use int64 for storing a customer's age. On disk it will compress to about the same size, but in memory it will use a lot more space with the larger type. Perhaps others can jump in with further recommendations based on experience. -Todd On Mon, May 7, 2018 at 1:45 AM, Saeid Sattari wrote: > Hi all, > > Folks who have used the column compression and encoding in Kudu tables: > can you share your experiences with the performance? What type of fields > are worse/better (IO bottleneck vs query return time,..) to compress. We > can collect a knowledge base regarding these subjects that users can use in > the future. Thanks. > > Regards, > > -- Todd Lipcon Software Engineer, Cloudera
Re: Kudu Exception -Couldnot find any valid location.
Hi Pranab, I see you got some help on Slack as well, but just to bring the answer here in case anyone else hits this issue: it sounds like your client is able to reach the kudu master process but unable to resolve the address of the kudu tablet servers. This might be because the tablet servers do not have a consistent DNS setup with your client -- eg they are using EC2-internal hostnames which are not resolvable from outside EC2. You can likely resolve this by setting up some kind of VPN into your EC2 VPC which includes DNS forwarding, but I'm not particularly knowledgeable about that aspect of AWS. -Todd On Sun, Apr 15, 2018 at 11:35 AM, Pranab Batsa wrote: > Hai, Can someone help me out. I am trying to connect my local system to kudu > on EC2, I am able to connect to the master and do operations like creating > table and all, but i am not able to do any operation on the table like > insert in to the table, It throws exception like :Couldnot find any valid > location. Unkown host exception. Thanks in advance for your valuable time. -- Todd Lipcon Software Engineer, Cloudera
Re: question about kudu performance!
> On Tue, Apr 3, 2018 at 7:38 PM, fengba...@uce.cn wrote: > > > > (1)version > > The component version is: > > > >- CDH 5.14 > >- Kudu 1.6 > > > > (2)Framework > > > > The size of the kudu cluster(total 10 machines,256G mem,16*1.2T sas disk): > > > > -3 master node > > > > -7 tablet server node > > > > Independent deployment of master and tablet server,but Yarn nodemanger and > > tablet server are deployed together > > > > (3) kudu parameter : > > > > maintenance_manager_num_threads=24(The number of data directories for each > > machine disk is 8) If you have 16 disks, why only 8 directories? I would recommend reducing this significantly. We usually recommend one thread for every 3 disks. > > > > memory_limit_hard_bytes=150G > > > > I have a performance problem: every 2-3 weeks, clusters start to make > > MajorDeltaCompactionOp, when kudu performs insert and update performance > > decreases, when the data is written, the update operation almost stops. > The major delta compactions should actually improve update performance, not decrease it. Do you have any more detailed metrics to explain the performance drop? If you upgrade to Kudu 1.7 the tservers will start to produce a diagnostics log. If you can send a diagnostics log segment from the point in time when the performance problem is occurring we can try to understand this behavior better. > > > Is it possible to adjust the --memory_limit_hard_bytes parameter to > > 256G*80%(my yarn nm and TS are deployed together)? If YARN is also scheduling work on these nodes, then you may end up swapping and that would really kill performance. I usually don't see improvements in Kudu performance by providing such huge amounts of memory. The one exception would be that you might get some improvement using a large block cache if your existing cache is showing a low hit rate. The metrics would help determine that. > > > > Can we adjust the parameter --tablet_history_max_age_sec to shorten the > > MajorDeltaCompactionOp interval? Nope, that won't affect the major delta compaction frequency. The one undocumented tunable that is relevant is --tablet_delta_store_major_compact_min_ratio (default 0.1). Raising this would decrease the frequency of major delta compaction, but I think there is likely something else going on here. -Todd > > > > Can you give me some suggestions to optimize this performance problem? Usually the best way to improve performance is by thinking carefully about schema design, partitioning, and workload, rather than tuning configuration. Maybe you can share more about your workload, schema, and partitioning. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Limitations on total amount of data stored in one kudu table
On Tue, Mar 20, 2018 at 2:15 AM, Кравец Владимир Александрович < krav...@kamatech.ru> wrote: > Hi, I'm new to Kudu and I'm trying to understand the applicability for our > purposes. So I met the following article about the kudu limitations - > https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_ > limitations.html#concept_cws_n4n_5z. Do I understand correctly that this > means that the maximum total amount of usefull compressed stored data in > one kudu-table is 8TB? Here my calcs: > I think there are a few mistakes below. Comments lineline. > 1. Amount of stored data per tablet = Recommended maximum amount of stored > data / Recommended maximum number of tablets per tablet server = 8 000 / 2 > 000 = 4 GB per tablet > That assumes that every tablet is equally sized and that you have hit the limit on number of tablets. Even though you _can_ have 2000 tablets per server, you might want fewer. In addition, you don't need to have every tablet be the same size -- some might be 10GB while others might be 1GB or smaller. > 2. Maximum number of tablets per table for each tablet server > pre-replication = Maximum number of tablets per table for each tablet > server is 60, post-replication / number of replicas = 60 / 3 = 20 tablets > per table per tablet server > The key word that you didn't copy here is "at table-creation time". This limitation has to do with avoiding some issues we have seen when trying to create too many tablets at the same time on the cluster. With range partitioning, you can always add more partitions later. For example it's very common to add a new partition for each day. So, a single table can, after some days, have more than 20 tablets on a given server. > 3. Total amount of stored data per table, pre-replication = Amount of > stored data per tablet * Maximum number of tablets per table for each > tablet server pre-replication * Maximum number of tablet servers = 4 GB * > 20 * 100 = 8TB > Per above, this isn't really the case. For example, on one cluster at Cloudera which runs an internal workload, we have one table that is 82TB and another which is 46TB. I've seen much larger tables in some user installations as well. > And I also would like to understand how fundamental the nature of the > limitation "Maximum number of tablets per table for each tablet server is > 60, post-replication"? Is it possible that this restriction will be removed? > See above. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: 答复: A few questions for using Kudu
On Thu, Mar 15, 2018 at 8:32 PM, 张晓宁 wrote: > Thank you Dan! My follow-up comments with XiaoNing. > > > > *发件人:* Dan Burkert [mailto:danburk...@apache.org] > *发送时间:* 2018年3月16日 1:06 > *收件人:* user@kudu.apache.org > *主题:* Re: A few questions for using Kudu > > > > Hi, answers inline: > > On Thu, Mar 15, 2018 at 3:12 AM, 张晓宁 wrote: > > I have a few questions for using kudu: > > 1. As more and more data inserted to kudu, the performance > decrease. After continuous data insertion for about 30 minutes, the TPS > performance decreased with 20%, and after 1-hour data insertion, the > performance decreased with 40%. Is this a known issue? > > This is expected if you are inserting data in random order. If you try > another benchmark where you insert data in primary key sorted order, you'll > see that the performance will be much higher, and more consistent. If you > have a heavy insert workload, this kind of optimization is critical. The > table's partitioning and primary key can often be designed to make this > happen naturally, but it's a dataset dependent thing, so without more > specifics about your data it's difficult to give more precise advice. > > XiaoNing: Our table has 2 partitions,the first level partition is by > date range(using the column timestamp),one partition for one single day, > and the second partition is by a hash on 2 column(key + host).These 3 > columns(timestamp,key,host) are the primary key of the table.For you > comment “insert data in primary key sorted order”,do you mean we need to > sort the data on the 3 primary-key columns before insertion? > If timestamp is the first column then it should probably be somewhat naturally-sorted by the primary key, right? It doesn't need to be perfectly sorted, but if the inserts are in roughly PK order, we will avoid unnecessary compaction. > 2. When setting the replica number to be 1, totally I will have 2 > copy of data(1 master data + 1 replica data), is this true? > > That's incorrect. The master node does not hold any table data. If you > set the number of replicas to be 1, you will lose data if you lose the > tablet server which holds the replica. We always recommend production > workloads set number of replicas to 3 in order to have fault tolerance. > > XiaoNing: So if we want to have fault tolerance, we should at least set > the replica number to be 3, right? > That's right. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Kudu client close exception
t; at > org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) > at > org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) > at > org.apache.kudu.client.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > at > org.apache.kudu.client.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > > Thanks > > Rainerdun > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: "broadcast" tablet replication for kudu?
I suppose in the case that the dimension table scan makes a non-trivial portion of your workload time, then yea, parallelizing the scan as you suggest would be beneficial. That said, in typical analytic queries, scanning the dimension tables is very quick compared to scanning the much-larger fact tables, so the extra parallelism on the dim table scan isn't worth too much. -Todd On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < mauri...@impactradius.com> wrote: > @Todd I know working with parquet in the past I've seen small dimensions > that fit in 1 single file/block limit parallelism of > join/exchange/aggregation nodes, and I've forced those dims to spread > across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or similar > when doing INSERT OVERWRITE to load them, which then allows these > operations to parallelize across that many nodes. > > Wouldn't it be useful here for Cliff's small dims to be partitioned into a > couple tablets to similarly improve parallelism? > > -m > > On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon wrote: > >> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick wrote: >> >>> Hey Todd, >>> >>> Thanks for that explanation, as well as all the great work you're doing >>> -- it's much appreciated! I just have one last follow-up question. Reading >>> about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller >>> table is always copied in its entirety BEFORE the predicate is evaluated. >>> >> >> That's not quite true. If you have a predicate on a joined column, or on >> one of the columns in the joined table, it will be pushed down to the >> "scan" operator, which happens before the "exchange". In addition, there is >> a feature called "runtime filters" that can push dynamically-generated >> filters from one side of the exchange to the other. >> >> >>> But since the Kudu client provides a serialized scanner as part of the >>> ScanToken API, why wouldn't Impala use that instead if it knows that the >>> table is Kudu and the query has any type of predicate? Perhaps if I >>> hash-partition the table I could maybe force this (because that complicates >>> a BROADCAST)? I guess this is really a question for Impala but perhaps >>> there is a more basic reason. >>> >> >> Impala could definitely be smarter, just a matter of programming >> Kudu-specific join strategies into the optimizer. Today, the optimizer >> isn't aware of the unique properties of Kudu scans vs other storage >> mechanisms. >> >> -Todd >> >> >>> >>> -Cliff >>> >>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon wrote: >>> >>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >>>> cresn...@mediamath.com> wrote: >>>> >>>>> I thought I had read that the Kudu client can configure a scan for >>>>> CLOSEST_REPLICA and assumed this was a way to take advantage of data >>>>> collocation. >>>>> >>>> >>>> Yea, when a client uses CLOSEST_REPLICA it will read a local one if >>>> available. However, that doesn't influence the higher level operation of >>>> the Impala (or Spark) planner. The planner isn't aware of the replication >>>> policy, so it will use one of the existing supported JOIN strategies. Given >>>> statistics, it will choose to broadcast the small table, which means that >>>> it will create a plan that looks like: >>>> >>>> >>>>+-+ >>>>| | >>>> +-->build JOIN | >>>> | | | >>>> | | probe | >>>> +--+ +-+ >>>> | | | >>>> | Exchange | | >>>> ++ (broadcast | | >>>> || | | >>>> |+--+ | >>>> | | >>>> +-+ | >>>> | |+---+ >>>
Re: "broadcast" tablet replication for kudu?
On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick wrote: > Hey Todd, > > Thanks for that explanation, as well as all the great work you're doing > -- it's much appreciated! I just have one last follow-up question. Reading > about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller > table is always copied in its entirety BEFORE the predicate is evaluated. > That's not quite true. If you have a predicate on a joined column, or on one of the columns in the joined table, it will be pushed down to the "scan" operator, which happens before the "exchange". In addition, there is a feature called "runtime filters" that can push dynamically-generated filters from one side of the exchange to the other. > But since the Kudu client provides a serialized scanner as part of the > ScanToken API, why wouldn't Impala use that instead if it knows that the > table is Kudu and the query has any type of predicate? Perhaps if I > hash-partition the table I could maybe force this (because that complicates > a BROADCAST)? I guess this is really a question for Impala but perhaps > there is a more basic reason. > Impala could definitely be smarter, just a matter of programming Kudu-specific join strategies into the optimizer. Today, the optimizer isn't aware of the unique properties of Kudu scans vs other storage mechanisms. -Todd > > -Cliff > > On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon wrote: > >> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >> cresn...@mediamath.com> wrote: >> >>> I thought I had read that the Kudu client can configure a scan for >>> CLOSEST_REPLICA and assumed this was a way to take advantage of data >>> collocation. >>> >> >> Yea, when a client uses CLOSEST_REPLICA it will read a local one if >> available. However, that doesn't influence the higher level operation of >> the Impala (or Spark) planner. The planner isn't aware of the replication >> policy, so it will use one of the existing supported JOIN strategies. Given >> statistics, it will choose to broadcast the small table, which means that >> it will create a plan that looks like: >> >> >>+-+ >>| | >> +-->build JOIN | >> | | | >> | | probe | >> +--+ +-+ >> | | | >> | Exchange | | >> ++ (broadcast | | >> || | | >> |+--+ | >> | | >> +-+ | >> | |+---+ >> | SCAN || | >> | KUDU || SCAN (other side) | >> | || | >> +-++---+ >> >> (hopefully the ASCII art comes through) >> >> In other words, the "scan kudu" operator scans the table once, and then >> replicates the results of that scan into the JOIN operator. The "scan kudu" >> operator of course will read its local copy, but it will still go through >> the exchange process. >> >> For the use case you're talking about, where the join is just looking up >> a single row by PK in a dimension table, ideally we'd be using an >> altogether different join strategy such as nested-loop join, with the inner >> "loop" actually being a Kudu PK lookup, but that strategy isn't implemented >> by Impala. >> >> -Todd >> >> >> >>> If this exists then how far out of context is my understanding of it? >>> Reading about HDFS cache replication, I do know that Impala will choose a >>> random replica there to more evenly distribute load. But especially >>> compared to Kudu upsert, managing mutable data using Parquet is painful. >>> So, perhaps to sum thing up, if nearly 100% of my metadata scan are single >>> Primary Key lookups followed by a tiny broadcast then am I really just >>> splitting hairs performance-wise between Kudu and HDFS-cached parquet? >>
Re: "broadcast" tablet replication for kudu?
On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick wrote: > I thought I had read that the Kudu client can configure a scan for > CLOSEST_REPLICA and assumed this was a way to take advantage of data > collocation. > Yea, when a client uses CLOSEST_REPLICA it will read a local one if available. However, that doesn't influence the higher level operation of the Impala (or Spark) planner. The planner isn't aware of the replication policy, so it will use one of the existing supported JOIN strategies. Given statistics, it will choose to broadcast the small table, which means that it will create a plan that looks like: +-+ | | +-->build JOIN | | | | | | probe | +--+ +-+ | | | | Exchange | | ++ (broadcast | | || | | |+--+ | | | +-+ | | |+---+ | SCAN || | | KUDU || SCAN (other side) | | || | +-++---+ (hopefully the ASCII art comes through) In other words, the "scan kudu" operator scans the table once, and then replicates the results of that scan into the JOIN operator. The "scan kudu" operator of course will read its local copy, but it will still go through the exchange process. For the use case you're talking about, where the join is just looking up a single row by PK in a dimension table, ideally we'd be using an altogether different join strategy such as nested-loop join, with the inner "loop" actually being a Kudu PK lookup, but that strategy isn't implemented by Impala. -Todd > If this exists then how far out of context is my understanding of it? > Reading about HDFS cache replication, I do know that Impala will choose a > random replica there to more evenly distribute load. But especially > compared to Kudu upsert, managing mutable data using Parquet is painful. > So, perhaps to sum thing up, if nearly 100% of my metadata scan are single > Primary Key lookups followed by a tiny broadcast then am I really just > splitting hairs performance-wise between Kudu and HDFS-cached parquet? > > From: Todd Lipcon > Reply-To: "user@kudu.apache.org" > Date: Friday, March 16, 2018 at 2:51 PM > > To: "user@kudu.apache.org" > Subject: Re: "broadcast" tablet replication for kudu? > > It's worth noting that, even if your table is replicated, Impala's planner > is unaware of this fact and it will give the same plan regardless. That is > to say, rather than every node scanning its local copy, instead a single > node will perform the whole scan (assuming it's a small table) and > broadcast it from there within the scope of a single query. So, I don't > think you'll see any performance improvements on Impala queries by > attempting something like an extremely high replication count. > > I could see bumping the replication count to 5 for these tables since the > extra storage cost is low and it will ensure higher availability of the > important central tables, but I'd be surprised if there is any measurable > perf impact. > > -Todd > > On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick > wrote: > >> Thanks for that, glad I was wrong there! Aside from replication >> considerations, is it also recommended the number of tablet servers be odd? >> >> I will check forums as you suggested, but from what I read after >> searching is that Impala relies on user configured caching strategies using >> HDFS cache. The workload for these tables is very light write, maybe a >> dozen or so records per hour across 6 or 7 tables. The size of the tables >> ranges from thousands to low millions of rows so so sub-partitioning would >> not be required. So perhaps this is not a typical use-case but I think it >> could work quite well with kudu. >> >> From: Dan Burkert >> Reply-To: "user@kudu.apache.org" >> Date: Friday, March 16, 2018 at 2:09 PM >> To: "user@kudu.apache.org" >> Subject: Re: "broadcast&
Re: "broadcast" tablet replication for kudu?
Also should mention that we currently limit the number of replicas of a table to 7 due to the '--max-num-replicas' flag. In order to change this you'd have to enable --unlock-unsafe-flags, which means you're going into untested territory. Your mileage may vary but I wouldn't try it on a production system. -Todd On Fri, Mar 16, 2018 at 12:00 PM, Dan Burkert wrote: > > On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick > wrote: > >> Thanks for that, glad I was wrong there! Aside from replication >> considerations, is it also recommended the number of tablet servers be odd? >> > > No, so long as you have enough tablet servers to host your desired > replication factor you should be fine. In production scenarios we > typically recommend at least 4, since if you are 3x replicated and suffer a > permanent node failure, the 4th node comes in handy as a fail-over target > (Kudu will do this automatically). But above and beyond that you don't > need to worry about odd/even WRT number of tablet servers. > > - Dan > > >> >> From: Dan Burkert >> Reply-To: "user@kudu.apache.org" >> Date: Friday, March 16, 2018 at 2:09 PM >> To: "user@kudu.apache.org" >> Subject: Re: "broadcast" tablet replication for kudu? >> >> The replication count is the number of tablet servers which Kudu will >> host copies on. So if you set the replication level to 5, Kudu will put >> the data on 5 separate tablet servers. There's no built-in broadcast table >> feature; upping the replication factor is the closest thing. A couple of >> things to keep in mind: >> >> - Always use an odd replication count. This is important due to how the >> Raft algorithm works. Recent versions of Kudu won't even let you specify >> an even number without flipping some flags. >> - We don't test much much beyond 5 replicas. It *should* work, but you >> may run in to issues since it's a relatively rare configuration. With a >> heavy write workload and many replicas you are even more likely to >> encounter issues. >> >> It's also worth checking in an Impala forum whether it has features that >> make joins against small broadcast tables better? Perhaps Impala can cache >> small tables locally when doing joins. >> >> - Dan >> >> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick < >> cresn...@mediamath.com> wrote: >> >>> The problem is, AFIK, that replication count is not necessarily the >>> distribution count, so you can't guarantee all tablet servers will have a >>> copy. >>> >>> On Mar 16, 2018 1:41 PM, Boris Tyukin wrote: >>> I'm new to Kudu but we are also going to use Impala mostly with Kudu. We >>> have a few tables that are small but used a lot. My plan is replicate them >>> more than 3 times. When you create a kudu table, you can specify number of >>> replicated copies (3 by default) and I guess you can put there a number, >>> corresponding to your node count in cluster. The downside, you cannot >>> change that number unless you recreate a table. >>> >>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick >>> wrote: >>> >>>> We will soon be moving our analytics from AWS Redshift to Impala/Kudu. >>>> One Redshift feature that we will miss is its ALL Distribution, where a >>>> copy of a table is maintained on each server. We define a number of >>>> metadata tables this way since they are used in nearly every query. We are >>>> considering using parquet in HDFS cache for these, and Kudu would be a much >>>> better fit for the update semantics but we are worried about the additional >>>> contention. I'm wondering if having a Broadcast, or ALL, tablet >>>> replication might be an easy feature to add to Kudu? >>>> >>>> -Cliff >>>> >>> >>> >> > -- Todd Lipcon Software Engineer, Cloudera
Re: "broadcast" tablet replication for kudu?
It's worth noting that, even if your table is replicated, Impala's planner is unaware of this fact and it will give the same plan regardless. That is to say, rather than every node scanning its local copy, instead a single node will perform the whole scan (assuming it's a small table) and broadcast it from there within the scope of a single query. So, I don't think you'll see any performance improvements on Impala queries by attempting something like an extremely high replication count. I could see bumping the replication count to 5 for these tables since the extra storage cost is low and it will ensure higher availability of the important central tables, but I'd be surprised if there is any measurable perf impact. -Todd On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick wrote: > Thanks for that, glad I was wrong there! Aside from replication > considerations, is it also recommended the number of tablet servers be odd? > > I will check forums as you suggested, but from what I read after searching > is that Impala relies on user configured caching strategies using HDFS > cache. The workload for these tables is very light write, maybe a dozen or > so records per hour across 6 or 7 tables. The size of the tables ranges > from thousands to low millions of rows so so sub-partitioning would not be > required. So perhaps this is not a typical use-case but I think it could > work quite well with kudu. > > From: Dan Burkert > Reply-To: "user@kudu.apache.org" > Date: Friday, March 16, 2018 at 2:09 PM > To: "user@kudu.apache.org" > Subject: Re: "broadcast" tablet replication for kudu? > > The replication count is the number of tablet servers which Kudu will host > copies on. So if you set the replication level to 5, Kudu will put the > data on 5 separate tablet servers. There's no built-in broadcast table > feature; upping the replication factor is the closest thing. A couple of > things to keep in mind: > > - Always use an odd replication count. This is important due to how the > Raft algorithm works. Recent versions of Kudu won't even let you specify > an even number without flipping some flags. > - We don't test much much beyond 5 replicas. It *should* work, but you > may run in to issues since it's a relatively rare configuration. With a > heavy write workload and many replicas you are even more likely to > encounter issues. > > It's also worth checking in an Impala forum whether it has features that > make joins against small broadcast tables better? Perhaps Impala can cache > small tables locally when doing joins. > > - Dan > > On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick > wrote: > >> The problem is, AFIK, that replication count is not necessarily the >> distribution count, so you can't guarantee all tablet servers will have a >> copy. >> >> On Mar 16, 2018 1:41 PM, Boris Tyukin wrote: >> I'm new to Kudu but we are also going to use Impala mostly with Kudu. We >> have a few tables that are small but used a lot. My plan is replicate them >> more than 3 times. When you create a kudu table, you can specify number of >> replicated copies (3 by default) and I guess you can put there a number, >> corresponding to your node count in cluster. The downside, you cannot >> change that number unless you recreate a table. >> >> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick wrote: >> >>> We will soon be moving our analytics from AWS Redshift to Impala/Kudu. >>> One Redshift feature that we will miss is its ALL Distribution, where a >>> copy of a table is maintained on each server. We define a number of >>> metadata tables this way since they are used in nearly every query. We are >>> considering using parquet in HDFS cache for these, and Kudu would be a much >>> better fit for the update semantics but we are worried about the additional >>> contention. I'm wondering if having a Broadcast, or ALL, tablet >>> replication might be an easy feature to add to Kudu? >>> >>> -Cliff >>> >> >> > -- Todd Lipcon Software Engineer, Cloudera
Re: Follow-up for "Kudu cluster performance cannot grow up with machines added"
On Mon, Mar 12, 2018 at 7:08 PM, 张晓宁 wrote: > To your and Brock’s questions, my answers are as below. > > What client are you using to benchmark? You might also be bound by the client > performance. > > My Answer: we are using many force testing machines to test the highest > TPS on kudu. Our testing client should have enough ability. > But, specifically, what client? Is it something you build directly using the Java client? The C++ client? How many threads are you using? Which flush mode are you using to write? What buffer sizes are you using? > I'd verify that the new nodes are assigned tablets? Along with considering > an increase the number of partitions on the table being tested. > > My Answer: Yes, with machines added each time, I created a new table for > testing so that tablets can be assigned to new machines. For the partition > strategy, I am using 2-level partitions: the first level is a range > partition by date(I use 3 partitions here, meaning 3-days data), and the > second level is a hash partition(I use 3, 6, and 9 respectively for the > clusters with 3, 6, and 9 tservers). > > Did you delete the original table and wait some time before creating the new table? Otherwise, you will see a skewed distribution where the new table will have most of its replicas placed on the new empty machines. For example: 1) with 6 servers, create table with 18 partitions -- it will evenly spread replicas on those 6 nodes (probably 9 each) 2) add 3 empty servers, create a new table with 27 partitions -- the new table will probably have about 18 partitions on the new nodes and 3 on the existing nodes (6:1 skew) 3) same again -- the new table will likely have most of its partitions on those 3 empty nodes again Of course with skew like that, you'll probably see that those new tables do not perform well since most of the work would be on a smaller subset of nodes. If you delete the tables in between the steps you should see a more even distribution. Another possibility that you may be hitting is that our buffering in the clients is currently cluster-wide. In other words, each time you apply an operation, it checks if the total buffer limit has been reached, and if it has, it flushes the pending writes to all tablets. Only once all of those writes are complete is the batch considered "completed", freeing up space for the next batch of writes to be buffered. This means that, as the number of tablets and tablet servers grow, the completion time for the batch is increasingly dominated by the high-percentile latencies of the writes rather than the average, causing per-client throughput to drop. This is tracked by KUDU-1693. I believe there was another JIRA somewhere related as well, but can't seem to find it. Unfortunately fixing it is not straightforward, though would have good impact for these cases where a single writer is fanning out to tens or hundreds of tablets. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Kudu cluster performance cannot grow up with machines added
What client are you using to benchmark? You might also be bound by the client performance. On Mar 11, 2018 2:04 PM, "Brock Noland" wrote: > Hi, > > I'd verify that the new nodes are assigned tablets? Along with > considering an increase the number of partitions on the table being tested. > > On Sun, Mar 11, 2018 at 4:44 AM, 张晓宁 wrote: > >> Hi All: >> >> >> >> I am testing kudu cluster performance these days, and found that the >> performance cannot grow up with machines added. With 1-master and 6-slaves, >> I got 3.5M TPS(350 million rows inserted per second), while when I add 3 >> more machines to the cluster, I only got 3.8M TPS. It seems the performance >> cannot grow up linearly with cluster extension. Why is that? Can anyone >> give some help? Thanks! >> >> >> >> [image: mail-signature-20171122] >> >> >> > >
Re: Kudu as a Graphite backend
Hey Mark, Yea, I wrote the original Graphite integration in the samples repo several years ago (prior to Kudu 0.5 even), but it was more of a quick prototype in order to have a demo of the Python client rather than something meant to be used in a production scenario. Of course with some work it could probably be updated and made more "real". You may also be interested in the 'kudu-ts' project that Dan Burkert started: https://github.com/danburkert/kudu-ts It provides an OpenTSDB-compatible interface on top of Kudu. Unfortunately it's also somewhat incomplete but could provide a decent starting point for a time series workload. It would be great if you wanted to contribute to either the graphite-kudu integration or kudu-ts. Neither is getting the love they deserve right now. -Todd On Mon, Mar 5, 2018 at 7:38 AM, Paul Brannan wrote: > Do you want to use kudu as a backend for carbon (i.e. have graphite/carbon > receive metrics and write them to kudu), or do you want to use graphite-web > as a frontend for timeseries you already have in kudu? Both are mostly > straightforward; see e.g. https://github.com/criteo/ > biggraphite/tree/master/biggraphite/plugins for an example of each. > AFAICT the examples (https://github.com/cloudera/ > kudu-examples/tree/master/python/graphite-kudu) are just a graphite-web > finder; you'd still need a carbon backend for storage, unless you insert > the data through a different mechanism. > > On Mon, Mar 5, 2018 at 6:15 AM, Mark Meyer wrote: > >> Hi List, >> has anybody experiences running Kudu as a Graphite backend in production? >> I've been looking at the samples repository, but have been unsure, >> primarily because of the 'samples' tag associated with the code. >> >> Best, Mark >> >> -- >> Mark Meyer >> Systems Engineer >> mark.me...@smaato.com >> Smaato Inc. >> San Francisco – New York – Hamburg – Singapore >> www.smaato.com >> >> Valentinskamp 70 >> <https://maps.google.com/?q=Valentinskamp+70&entry=gmail&source=g>, >> Emporio, 19th Floor >> 20355 Hamburg >> T: 0049 (40) 3480 949 0 >> F: 0049 (40) 492 19 055 >> >> The information contained in this communication may be CONFIDENTIAL and >> is intended only for the use of the recipient(s) named above. If you are >> not the intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication, or any of its contents, is >> strictly prohibited. If you have received this communication in error, >> please notify the sender and delete/destroy the original message and any >> copy of it from your computer or paper files. >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation
Hi Boris, Thanks for the feedback and sharing your experience. Like you said, this is more of an issue with downstream documentation so it's probably not appropriate to continue discussing this in the context of the Apache Kudu open source project. For feedback on CDH it's best to direct that towards the associated vendor (Cloudera). Even though I happen to be employed by them, I and other employees participate here in the Apache Kudu project as individuals and it's important to keep the distinction separate. Kudu is a product of the ASF non-profit organization, not a product of any commercial vendor. -Todd On Wed, Feb 28, 2018 at 6:17 AM, Boris Tyukin wrote: > first of all, thanks for a very quick response. It means a lot to know you > guys stand behind Kudu and work with users like myself. In fact, we get a > better support here than through the official channels :) > > That said, I needed to vent a bit :) Granted this should be probably > directed at Impala devs but I think Kudu has been impacted the most. > > we ran a cluster health check command and saw some warning in there. But > we proceeded with running insert statement as Todd suggested with hints and > now we are back to our time we've used to get with Kudu 1.3 / Impala 2.8 > > it is a bit frustrating I must say that these changes impacted performance > so dramatically. In my opinion, Kudu and CDH release notes must have stuff > like that in BOLD RED colors so it does not catch users by surprise. And > not everyone so organized like us - we knew exactly how much time it took > with Kudu 1.3 /Impala 2.8. > > I've tracked down all the related JIRAs and I did not see a word about a > dramatic performance degradation. I did see words like 'optimized' and > 'improved'. > > Since it is part of the official CDH distro, I would expect a little bit > more proactive warning. > > If you want to open a JIRA on this, would be happy to do it...We see this > degradation all across our tables and as you can see from my example query, > it is a really straight select from a table - no joins, no predicates and > no complex calculations. > > Thanks again, > Boris > > On Thu, Feb 22, 2018 at 2:44 PM, Todd Lipcon wrote: > >> In addition to what Hao suggests, I think it's worth noting that the >> insert query plan created by Impala changed a bit over time. >> >> It sounds like you upgraded from Impala 2.8 (in CDH 5.11), which used a >> very straightforward insert plan - each node separately inserted rows in >> whatever order the rows were consumed. This plan worked well for smaller >> inserts but could cause timeouts with larger workloads. >> >> In Impala 2.9, the plan was changed so that Impala performs some >> shuffling and sorting before inserting into Kudu. This makes the Kudu >> insert pattern more reliable and efficient, but could cause a degradation >> for some workloads since Impala's sorts are single-threaded. >> >> Impala 2.10 (which I guess you are running) improved a bit over 2.9 in >> ensuring that the sorts can be "partial" which resolved some of the >> performance degradation, but it's possible your workload is still affected >> negatively. >> >> To disable the new behavior you can use the insert hints 'noshuffle' >> and/or 'noclustered', such as: >> >> upsert into my_table /* +noclustered,noshuffle */ select * from >> my_other_table; >> >> >> Hope that helps >> -Todd >> >> On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao wrote: >> >>> Did you happen to check the health of the cluster after the upgrade by 'kudu >>> cluster ksck'? >>> >>> Best, >>> Hao >>> >>> On Thu, Feb 22, 2018 at 6:31 AM, Boris Tyukin >>> wrote: >>> >>>> Hello, >>>> >>>> we just upgraded our dev cluster from Kudu 1.3 to kudu 1.5.0-cdh5.13.1 >>>> and noticed quite severe performance degradation. We did CTAS from Impala >>>> parquet table which has not changed a bit since the upgrade (even the same >>>> # of rows) to Kudu using the follow query below. >>>> >>>> It used to take 11-11.5 hours on Kudu 1.3 and now taking 50-55 hours. >>>> >>>> Of course Impala version was also bumped with CDH 5.13. >>>> >>>> Any clue why it takes so much time now? >>>> >>>> Table has 5.5B rows.. >>>> >>>> create TABLE kudutest_ts.clinical_event_nots >>>> >>>> PRIMARY KEY (clinical_event_id) >>>> >>>> PARTI
Re: swap data in Kudu table
A couple other ideas from the Impala side: - could you use a view and alter the view to point to a different table? Then all readers would be pointed at the view, and security permissions could be on that view rather than the underlying tables? - I think if you use an external table in Impala you could use an ALTER TABLE TBLPROPERTIES ... statement to change kudu.table_name to point to a different table. Then issue a 'refresh' on the impalads so that they load the new metadata. Subsequent queries would hit the new underlying Kudu table, but permissions and stats would be unchanged. -Todd On Fri, Feb 23, 2018 at 1:16 PM, Mike Percy wrote: > Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk > load capabilities or staging abilities. Theoretically renaming a partition > atomically shouldn't be that hard to implement, since it's just a master > metadata operation which can be done atomically, but it's not yet > implemented. > > There is a JIRA to track a generic bulk load API here: > https://issues.apache.org/jira/browse/KUDU-1370 > > Since I couldn't find anything to track the specific features you > mentioned, I just filed the following improvement JIRAs so we can track it: > >- KUDU-2326: Support atomic bulk load operation ><https://issues.apache.org/jira/browse/KUDU-2326> >- KUDU-2327: Support atomic swap of tables or partitions ><https://issues.apache.org/jira/browse/KUDU-2327> > > Mike > > On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin > wrote: > >> Hello, >> >> I am trying to figure out the best and safest way to swap data in a >> production Kudu table with data from a staging table. >> >> Basically, once in a while we need to perform a full reload of some >> tables (once in a few months). These tables are pretty large with billions >> of rows and we want to minimize the risk and downtime for users if >> something bad happens in the middle of that process. >> >> With Hive and Impala on HDFS, we can use a very cool handy command LOAD >> DATA INPATH. We can prepare data for reload in a staging table upfront and >> this process might take many hours. Once staging table is ready, we can >> issue LOAD DATA INPATH command which will move underlying HDFS files to a >> production table - this operation is almost instant and the very last step >> in our pipeline. >> >> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE >> PARTITION command. >> >> Now with Kudu, I cannot seem to find a good strategy. The only thing came >> to my mind is to drop the production table and rename a staging table to >> production table as the last step of the job, but in this case we are going >> to lose statistics and security permissions. >> >> Any other ideas? >> >> Thanks! >> Boris >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Impala Parquet to Kudu 1.5 - severe ingest performance degradation
In addition to what Hao suggests, I think it's worth noting that the insert query plan created by Impala changed a bit over time. It sounds like you upgraded from Impala 2.8 (in CDH 5.11), which used a very straightforward insert plan - each node separately inserted rows in whatever order the rows were consumed. This plan worked well for smaller inserts but could cause timeouts with larger workloads. In Impala 2.9, the plan was changed so that Impala performs some shuffling and sorting before inserting into Kudu. This makes the Kudu insert pattern more reliable and efficient, but could cause a degradation for some workloads since Impala's sorts are single-threaded. Impala 2.10 (which I guess you are running) improved a bit over 2.9 in ensuring that the sorts can be "partial" which resolved some of the performance degradation, but it's possible your workload is still affected negatively. To disable the new behavior you can use the insert hints 'noshuffle' and/or 'noclustered', such as: upsert into my_table /* +noclustered,noshuffle */ select * from my_other_table; Hope that helps -Todd On Thu, Feb 22, 2018 at 11:02 AM, Hao Hao wrote: > Did you happen to check the health of the cluster after the upgrade by 'kudu > cluster ksck'? > > Best, > Hao > > On Thu, Feb 22, 2018 at 6:31 AM, Boris Tyukin > wrote: > >> Hello, >> >> we just upgraded our dev cluster from Kudu 1.3 to kudu 1.5.0-cdh5.13.1 >> and noticed quite severe performance degradation. We did CTAS from Impala >> parquet table which has not changed a bit since the upgrade (even the same >> # of rows) to Kudu using the follow query below. >> >> It used to take 11-11.5 hours on Kudu 1.3 and now taking 50-55 hours. >> >> Of course Impala version was also bumped with CDH 5.13. >> >> Any clue why it takes so much time now? >> >> Table has 5.5B rows.. >> >> create TABLE kudutest_ts.clinical_event_nots >> >> PRIMARY KEY (clinical_event_id) >> >> PARTITION BY HASH(clinical_event_id) PARTITIONS 120 >> >> STORED AS KUDU >> >> AS >> >> SELECT >> >> clinical_event_id, >> >> encntr_id, >> >> person_id, >> >> encntr_financial_id, >> >> event_id, >> >> event_title_text, >> >> CAST(view_level as string) as view_level, >> >> order_id, >> >> catalog_cd, >> >> series_ref_nbr, >> >> accession_nbr, >> >> contributor_system_cd, >> >> reference_nbr, >> >> parent_event_id, >> >> event_reltn_cd, >> >> event_class_cd, >> >> event_cd, >> >> event_tag, >> >> CAST(event_end_dt_tm_os as BIGINT) as event_end_dt_tm_os, >> >> result_val, >> >> result_units_cd, >> >> result_time_units_cd, >> >> task_assay_cd, >> >> record_status_cd, >> >> result_status_cd, >> >> CAST(authentic_flag as STRING) authentic_flag, >> >> CAST(publish_flag as STRING) publish_flag, >> >> qc_review_cd, >> >> normalcy_cd, >> >> normalcy_method_cd, >> >> inquire_security_cd, >> >> resource_group_cd, >> >> resource_cd, >> >> CAST(subtable_bit_map as STRING) subtable_bit_map, >> >> collating_seq, >> >> verified_prsnl_id, >> >> performed_prsnl_id, >> >> updt_id, >> >> CAST(updt_task as STRING) updt_task, >> >> updt_cnt, >> >> CAST(updt_applctx as STRING) updt_applctx, >> >> normal_low, >> >> normal_high, >> >> critical_low, >> >> critical_high, >> >> CAST(event_tag_set_flag as STRING) event_tag_set_flag, >> >> CAST(note_importance_bit_map as STRING) note_importance_bit_map, >> >> CAST(order_action_sequence as STRING) order_action_sequence, >> >> entry_mode_cd, >> >> source_cd, >> >> clinical_seq, >> >> CAST(event_end_tz as STRING) event_end_tz, >> >> CAST(event_start_tz as STRING) event_start_tz, >> >> CAST(performed_tz as STRING) performed_tz, >> >> CAST(verified_tz as STRING) verified_tz, >> >> task_assay_version_nbr, >> >> modifier_long_text_id, >> >> ce_dynamic_label_id, >> >> CAST(nomen_string_flag as STRING) nomen_string_flag, >> >> src_event_id, >> >> CAST(last_utc_ts as BIGINT) last_utc_ts, >> >> device_free_txt, >> >> CAST(trait_bit_map as STRING) trait_bit_map, >> >> CAST(clu_subkey1_flag as STRING) clu_subkey1_flag, >> >> CAST(clinsig_updt_dt_tm as BIGINT) clinsig_updt_dt_tm, >> >> CAST(event_end_dt_tm as BIGINT) event_end_dt_tm, >> >> CAST(event_start_dt_tm as BIGINT) event_start_dt_tm, >> >> CAST(expiration_dt_tm as BIGINT) expiration_dt_tm, >> >> CAST(verified_dt_tm as BIGINT) verified_dt_tm, >> >> CAST(src_clinsig_updt_dt_tm as BIGINT) src_clinsig_updt_dt_tm, >> >> CAST(updt_dt_tm as BIGINT) updt_dt_tm, >> >> CAST(valid_from_dt_tm as BIGINT) valid_from_dt_tm, >> >> CAST(valid_until_dt_tm as BIGINT) valid_until_dt_tm, >> >> CAST(performed_dt_tm as BIGINT) performed_dt_tm, >> >> txn_id_text, >> >> CAST(ingest_dt_tm as BIGINT) ingest_dt_tm >> >> FROM v500.clinical_event >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Renaming hostname of tserver/master
Hey Pavel, Did you find a workaround for this? As you found it's not something we support easily today, because each server will "remember" the old host names of all of its replication peers within its consensus metadata, and continue to try to talk to them after a restart. Renaming hosts would involve writing the consensus metadata files on all of the servers with the updated names. We're soon facing this same issue with one of the test clusters at Cloudera, so I'm guessing we'll have to write some kind of tool to get it back online as well :) Would be curious to hear what approach you took. -Todd On Tue, Jan 30, 2018 at 11:08 PM, Pavel Martynov wrote: > Ok, I found ticket https://issues.apache.org/jira/browse/KUDU-418, which > fired at me. > -- Todd Lipcon Software Engineer, Cloudera
Re: Using Kudu to Handle Huge amount of Data
Hi JP, Answers inline... On Thu, Feb 1, 2018 at 9:45 PM, Jp Gupta wrote: > Hi, > As an existing HBase user, we handle close to 20TB of data everyday. > What does "handle" mean in this case? You are inserting 20TB of new data each day, so that your total dataset grows by that amount? How much data do you retain? How many nodes is your cluster? (I would guess many hundred?) > > While we are contemplating on moving to Kudu to take advantage of the new > technology, I am yet to hear of an real industry use case where Kudu is > being to used to handle of huge amount of data. > If you are seeing Kudu as an "improved HBase" that isn't really accurate. Of course there are some things we can do better than HBase, but there are some things HBase can do better than Kudu. As for Kudu data sizes, I am aware of some organizations storing several hundred TB in a Kudu cluster, but I have not yet heard of a use case with 1PB+. If you are looking to run at that scale you may hit some issues, but we are standing ready to help you overcome them. I don't see any fundamental problems that would prevent it, and I have run some basic smoke tests of Kudu on ~800 nodes before. > > Looking forward to your inputs on any organisation using Kudu where data > volumes of more than 10 TB is ingested everyday. > Hope some other users can chime in. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Bulk / Initial load of large tables into Kudu using Spark
On Mon, Jan 29, 2018 at 1:19 PM, Boris Tyukin wrote: > thank you both. Does it make a difference from performance perspective > though if I do a bulk load through Impala versus Spark? is the Kudu client > with Spark will be faster than Impala? > Impala in recent versions has some tricks it does to pre-sort and pre-shuffle the data to avoid compactions in Kudu during the insert. Spark does not currently have these optimizations. So I would guess that Impala would be able to bulk load large datasets more efficiently than Spark for the time being. -Todd > > On Mon, Jan 29, 2018 at 2:22 PM, Todd Lipcon wrote: > >> On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles >> wrote: >> >>> Hi Boris. >>> >>> 1) I would like to bypass Impala as data for my bulk load coming from >>>> sqoop and avro files are stored on HDFS. >>>> >>> What's the objection to Impala? In the example below, Impala reads from >>> an HDFS-resident table, and writes to the Kudu table. >>> >>> >>>> 2) we do not want to deal with MapReduce. >>>> >>> >>> You can still use Spark... the MR reference is in regards to the >>> Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use >>> these. See, for example: >>> >>> https://dzone.com/articles/implementing-hadoops-input-format >>> -and-output-forma >>> >> >> While that's possible I'd recommend using the dataframes API instead. eg >> see https://kudu.apache.org/docs/developing.html#_kudu_integ >> ration_with_spark >> >> That should work as well (or better) than the MR outputformat. >> >> -Todd >> >> >> >>> However, you'll have to write (simple) Spark code, whereas with method >>> #1 you do effectively the same thing under the covers using SQL statements >>> via Impala. >>> >>> >>>> >>>> Thanks! >>>> What’s the most efficient way to bulk load data into Kudu? >>>> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu> >>>> >>>> The easiest way to load data into Kudu is if the data is already >>>> managed by Impala. In this case, a simple INSERT INTO TABLE >>>> some_kudu_table SELECT * FROM some_csv_tabledoes the trick. >>>> >>>> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, >>>> HBase, or any other data store that has an InputFormat. >>>> >>>> No tool is provided to load data directly into Kudu’s on-disk data >>>> format. We have found that for many workloads, the insert performance of >>>> Kudu is comparable to bulk load performance of other systems. >>>> >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Bulk / Initial load of large tables into Kudu using Spark
On Mon, Jan 29, 2018 at 11:18 AM, Patrick Angeles wrote: > Hi Boris. > > 1) I would like to bypass Impala as data for my bulk load coming from >> sqoop and avro files are stored on HDFS. >> > What's the objection to Impala? In the example below, Impala reads from an > HDFS-resident table, and writes to the Kudu table. > > >> 2) we do not want to deal with MapReduce. >> > > You can still use Spark... the MR reference is in regards to the > Input/OutputFormat classes, which are defined in Hadoop MR. Spark can use > these. See, for example: > > https://dzone.com/articles/implementing-hadoops-input- > format-and-output-forma > While that's possible I'd recommend using the dataframes API instead. eg see https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark That should work as well (or better) than the MR outputformat. -Todd > However, you'll have to write (simple) Spark code, whereas with method #1 > you do effectively the same thing under the covers using SQL statements via > Impala. > > >> >> Thanks! >> What’s the most efficient way to bulk load data into Kudu? >> <https://kudu.apache.org/faq.html#whats-the-most-efficient-way-to-bulk-load-data-into-kudu> >> >> The easiest way to load data into Kudu is if the data is already managed >> by Impala. In this case, a simple INSERT INTO TABLE some_kudu_table >> SELECT * FROM some_csv_tabledoes the trick. >> >> You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, >> HBase, or any other data store that has an InputFormat. >> >> No tool is provided to load data directly into Kudu’s on-disk data >> format. We have found that for many workloads, the insert performance of >> Kudu is comparable to bulk load performance of other systems. >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: new Kudu benchmarks
Thanks for making the updates. I tweeted it from my account and from @ApacheKudu. feel free to retweet! -Todd On Sat, Jan 6, 2018 at 1:10 PM, Boris Tyukin wrote: > thanks Todd, updated my post with that info and also changes title a bit. > thanks again for your feedback! look forward to new releases coming up! > > Boris > > On Fri, Jan 5, 2018 at 9:08 PM, Todd Lipcon wrote: > >> On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin >> wrote: >> >>> Hi Todd, >>> >>> thanks for your feedback! sure will be happy to update my post with your >>> suggestions. I am not sure Apache Parquet will be clear though as some >>> might understand it as using parquet files with Hive or Spark. What do you >>> think about "Impala on Kudu vs Impala on Parquet"? Realistically, for BI >>> users, Impala is the only option now with Kudu. Not many typical users will >>> use Kudu API clients or even Spark and Hive serde for Kudu does not exist. >>> >> >> I think "Impala on Kudu vs Parquet" or "Impala Storage Comparison: Kudu >> vs Parquet" or something would be a reasonable title. >> >> >>> >>> As for decimals, this is exciting news. Where can I found info about >>> timestamp support? I saw this JIRA >>> https://issues.apache.org/jira/browse/IMPALA-5137 >>> >>> but I was a bit confused by the actual change. It looked like a >>> workaround to do a conversion on the fly for impala but not actually store >>> proper timestamps in Kudu. Maybe I misread that. I thought the idea was to >>> add a proper support in Kudu so timestamp can be used as a type with other >>> clients not only Impala. If you can clarify that, it would be great >>> >> >> What we implemented is "proper" timestamp support in Kudu, but you're >> right that there is some conversion going on under the hood. The reasoning >> is that Impala internally uses a 96-bit timestamp representation which >> supports a very large range of dates at nanosecond precision. This is more >> than is required by the SQL standard and doesn't match the timestamp >> representation used by other ecosystem components. As far as I know, Impala >> is planning on moving to a 64-bit timestamp representation with microsecond >> precision, so that's what Kudu implemented internally. With 64 bits there >> is still enough range to store dates for 584,554 years at microsecond >> precision. >> >> I think https://impala.apache.org/docs/build/html/topics/impal >> a_timestamp.html has some info about Kudu compatibility and limitations. >> >> -Todd >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: new Kudu benchmarks
On Fri, Jan 5, 2018 at 5:50 PM, Boris Tyukin wrote: > Hi Todd, > > thanks for your feedback! sure will be happy to update my post with your > suggestions. I am not sure Apache Parquet will be clear though as some > might understand it as using parquet files with Hive or Spark. What do you > think about "Impala on Kudu vs Impala on Parquet"? Realistically, for BI > users, Impala is the only option now with Kudu. Not many typical users will > use Kudu API clients or even Spark and Hive serde for Kudu does not exist. > I think "Impala on Kudu vs Parquet" or "Impala Storage Comparison: Kudu vs Parquet" or something would be a reasonable title. > > As for decimals, this is exciting news. Where can I found info about > timestamp support? I saw this JIRA > https://issues.apache.org/jira/browse/IMPALA-5137 > > but I was a bit confused by the actual change. It looked like a workaround > to do a conversion on the fly for impala but not actually store proper > timestamps in Kudu. Maybe I misread that. I thought the idea was to add a > proper support in Kudu so timestamp can be used as a type with other > clients not only Impala. If you can clarify that, it would be great > What we implemented is "proper" timestamp support in Kudu, but you're right that there is some conversion going on under the hood. The reasoning is that Impala internally uses a 96-bit timestamp representation which supports a very large range of dates at nanosecond precision. This is more than is required by the SQL standard and doesn't match the timestamp representation used by other ecosystem components. As far as I know, Impala is planning on moving to a 64-bit timestamp representation with microsecond precision, so that's what Kudu implemented internally. With 64 bits there is still enough range to store dates for 584,554 years at microsecond precision. I think https://impala.apache.org/docs/build/html/topics/impala_timestamp.html has some info about Kudu compatibility and limitations. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: new Kudu benchmarks
Hey Mauricio, Answers inline below On Fri, Jan 5, 2018 at 2:50 PM, Mauricio Aristizabal < mauri...@impactradius.com> wrote: > Todd, since you bring it up in this thread... what CDH version do you > expect DECIMAL support to make it into? I recently asked Icaro Vazquez > about it but still no news. We're hoping it makes it into 5.14 otherwise > according to the roadmap there might not be another minor release and we'd > be waiting till Summer for CDH 6. > As this is an open source project mailing list, it would be inappropriate for me to comment on a vendor's release schedule. Please note that Kudu is a product of the Apache Software Foundation and the ASF doesn't have any influence on or knowledge of Cloudera's release plans. Of course it happens that I and many other contributors are also employees of Cloudera, but we participate in the ASF as individuals and not representatives of our employer, and so generally won't comment on questions like this in this forum. Please refer to Cloudera's forums for questions about CDH release plans, etc. > > And just in case we're forced to make do without DECIMAL initially, is the > recommendation really to store as string and convert? I was thinking of > storing as int/long and dividing by 10 or 1000 as needed in an impala view > over the kudu table. Wouldn't a division be way more performant than a > conversion from string, especially when aggregating over thousands of > records in a report query? > You're right -- using an integer type and division by a power of 10 is going to be much faster than casting from a string. Division by a constant would be JITted by Impala into a pretty minimal sequence of assembly instructions (two bitshifts, an integer multiplication, and a subtraction) which likely take about 6 cycles total. In contrast, a cast from string to decimal probably takes many thousands of cycles. The only downside is that if you have end users using the data they might be confused by the integer representation whereas a string representation would be a little clearer. Thanks -Todd > > On Fri, Jan 5, 2018 at 11:13 AM, Todd Lipcon wrote: > >> Oh, one other piece of feedback: maybe worth editing the title to say "vs >> Apache Parquet" instead of "vs Apache Impala" since in all cases you are >> using Impala as the query engine? >> >> -Todd >> >> On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon wrote: >> >>> Hey Boris, >>> >>> Thanks for publishing this. It's a great look at how an end user >>> evaluates Kudu. I appreciate that you cover both the pros and cons of the >>> technology, and glad to see that your conclusion leaves you excited about >>> Kudu :) >>> >>> One quick note is that I think you'll be even more pleased when you >>> upgrade to a later version (eg Kudu 1.5). We've improved performance in >>> several areas and also improved scalability compared to the version you're >>> testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It >>> might be worth noting this as an addendum to the blog post if you feel like >>> it. >>> >>> -Todd >>> >>> On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin >>> wrote: >>> >>>> Hi guys, >>>> >>>> we just finished testing Kudu, mostly comparing Kudu to Impala on >>>> HDFS/parquet. I wanted to share my blog post and results. We used typical >>>> (and real) healthcare data for the test, not a synthetic data which I think >>>> makes it is a bit more interesting. >>>> >>>> I welcome any feedback! >>>> >>>> http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/ >>>> >>>> We are really impressed with Kudu and I wanted to take an opportunity >>>> to thank Kudu developers for such an amazing and much-needed product. >>>> >>>> Boris >>>> >>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > > > -- > *MAURICIO ARISTIZABAL* > Architect - Business Intelligence + Data Science > mauri...@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260> > 223 E. De La Guerra St. | Santa Barbara, CA 93101 > <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> > > Overview <http://www.impactradius.com/?src=slsap> | Twitter > <https://twitter.com/impactradius> | Facebook > <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn > <https://www.linkedin.com/company/impact-radius-inc-> > -- Todd Lipcon Software Engineer, Cloudera
Re: new Kudu benchmarks
Oh, one other piece of feedback: maybe worth editing the title to say "vs Apache Parquet" instead of "vs Apache Impala" since in all cases you are using Impala as the query engine? -Todd On Fri, Jan 5, 2018 at 11:06 AM, Todd Lipcon wrote: > Hey Boris, > > Thanks for publishing this. It's a great look at how an end user evaluates > Kudu. I appreciate that you cover both the pros and cons of the technology, > and glad to see that your conclusion leaves you excited about Kudu :) > > One quick note is that I think you'll be even more pleased when you > upgrade to a later version (eg Kudu 1.5). We've improved performance in > several areas and also improved scalability compared to the version you're > testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It > might be worth noting this as an addendum to the blog post if you feel like > it. > > -Todd > > On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin > wrote: > >> Hi guys, >> >> we just finished testing Kudu, mostly comparing Kudu to Impala on >> HDFS/parquet. I wanted to share my blog post and results. We used typical >> (and real) healthcare data for the test, not a synthetic data which I think >> makes it is a bit more interesting. >> >> I welcome any feedback! >> >> http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/ >> >> We are really impressed with Kudu and I wanted to take an opportunity to >> thank Kudu developers for such an amazing and much-needed product. >> >> Boris >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: new Kudu benchmarks
Hey Boris, Thanks for publishing this. It's a great look at how an end user evaluates Kudu. I appreciate that you cover both the pros and cons of the technology, and glad to see that your conclusion leaves you excited about Kudu :) One quick note is that I think you'll be even more pleased when you upgrade to a later version (eg Kudu 1.5). We've improved performance in several areas and also improved scalability compared to the version you're testing. TIMESTAMP is also supported now, with DECIMAL soon to follow. It might be worth noting this as an addendum to the blog post if you feel like it. -Todd On Fri, Jan 5, 2018 at 10:51 AM, Boris Tyukin wrote: > Hi guys, > > we just finished testing Kudu, mostly comparing Kudu to Impala on > HDFS/parquet. I wanted to share my blog post and results. We used typical > (and real) healthcare data for the test, not a synthetic data which I think > makes it is a bit more interesting. > > I welcome any feedback! > > http://boristyukin.com/benchmarking-apache-kudu-vs-apache-impala/ > > We are really impressed with Kudu and I wanted to take an opportunity to > thank Kudu developers for such an amazing and much-needed product. > > Boris > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Data inconsistency after restart
gt;>> >>>>>> >>>>>> In general, you can use the `ksck` tool to check the health of your >>>>>> cluster. See https://kudu.apache.org/docs/command_line_tools_referenc >>>>>> e.html#cluster-ksck for more details. For restarting a cluster, I >>>>>> would recommend taking down all tablet servers at once, otherwise >>>>>> tablet >>>>>> replicas may try to replicate data from the server that was taken >>>>>> down. >>>>>> >>>>>> Hope this helped, >>>>>> Andrew >>>>>> >>>>>> On Tue, Dec 5, 2017 at 10:42 AM, Petter von Dolwitz (Hem) < >>>>>> petter.von.dolw...@gmail.com> wrote: >>>>>> >>>>>> Hi Kudu users, >>>>>>> >>>>>>> We just started to use Kudu (1.4.0+cdh5.12.1). To make a baseline for >>>>>>> evaluation we ingested 3 month worth of data. During ingestion we >>>>>>> were >>>>>>> facing messages from the maintenance threads that a soft memory >>>>>>> limit were >>>>>>> reached. It seems like the background maintenance threads stopped >>>>>>> performing their tasks at this point in time. It also so seems like >>>>>>> the >>>>>>> memory was never recovered even after stopping ingestion so I guess >>>>>>> there >>>>>>> was a large backlog being built up. I guess the root cause here is >>>>>>> that we >>>>>>> were a bit too conservative when giving Kudu memory. After a >>>>>>> reststart a >>>>>>> lot of maintenance tasks were started (i.e. compaction). >>>>>>> >>>>>>> When we verified that all data was inserted we found that some data >>>>>>> was missing. We added this missing data and on some chunks we got the >>>>>>> information that all rows were already present, i.e impala says >>>>>>> something >>>>>>> like Modified: 0 rows, nnn errors. Doing the verification again >>>>>>> now >>>>>>> shows that the Kudu table is complete. So, even though we did not >>>>>>> insert >>>>>>> any data on some chunks, a count(*) operation over these chunks now >>>>>>> returns >>>>>>> a different value. >>>>>>> >>>>>>> Now to my question. Will data be inconsistent if we recycle Kudu >>>>>>> after >>>>>>> seeing soft memory limit warnings? >>>>>>> >>>>>>> Is there a way to tell when it is safe to restart Kudu to avoid these >>>>>>> issues? Should we use any special procedure when restarting (e.g. >>>>>>> only >>>>>>> restart the tablet servers, only restart one tablet server at a time >>>>>>> or >>>>>>> something like that)? >>>>>>> >>>>>>> The table design uses 50 tablets per day (times 90 days). It is 8 TB >>>>>>> of data after 3xreplication over 5 tablet servers. >>>>>>> >>>>>>> Thanks, >>>>>>> Petter >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Andrew Wong >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Andrew Wong >>>>> >>>>> >>>> >>>> >>> > > -- > David Alves > -- Todd Lipcon Software Engineer, Cloudera
[ANNOUNCE] New committers over past several months
Hi Kudu community, I'm pleased to announce that the Kudu PMC has voted to add Andrew Wong, Grant Henke, and Hao Hao as Kudu committers and PMC members. This announcement is a bit delayed, but I figured it's better late than never! Andrew has contributed to Kudu in a bunch of areas. Most notably, he authored a bunch of optimizations for predicate evaluation on the read path, and recently has led the effort to introduce better tolerance of disk failures within the tablet server. In addition to code, Andrew has been a big help with questions on the user mailing list, Slack, and elsewhere. Grant's contributions have spanned several areas. Notably, he made a bunch of improvements to our Java and Scala builds -- an area where others might be shy. He also implemented checksum verification for data blocks and has begun working on a design for DECIMAL, one of the most highly-requested features. Hao has also been contributing to Kudu for quite some time. Her notable contributions include improved fault tolerance for the Java client, fixes and optimizations on the Spark integration, and some important refactorings and performance optimizations in the block layer. Hao has also represented the community by giving talks about Kudu at a conference in China. Please join me in congratulating the new committers and PMC members! -Todd
Re: INT128 Column Support Interest
On Mon, Nov 20, 2017 at 1:12 PM, Grant Henke wrote: > Thank you for the feedback. Below are some responses. > > Do we have a compatible SQL type to map this to in Spark SQL, Impala, > > Presto, etc? What type would we map to in Java? > > > In Java we would Map to a BigInteger. Their isn't a perfectly natural > mapping for SQL that I know of. It has been mentioned in the past that we > could have server side flags to disable/enable the ability to create > columns of certain types to prevent users from creating tables that are not > readable by certain integrations. This problem exists today with the BINARY > column type. > I'm somewhat against such a configuration. This being a server-side configuration results in Kudu deployments in different environments having different sets of available types, which seems very difficult for downstream users to deal with. Even though "least common denominator" kind of sucks, it's also not a bad policy for software that aims to be part of a pretty diverse ecosystem. > > > Why not just _not_ expose it and only expose decimal. > > > Technically decimal only supports 28 9's where INT128 can support slightly > larger numbers. Their may also be more overhead dealing with a decimal > type. Though I am not positive about that. > I think without clear user demand for >28 digits it's just not worth the complexity. > > Encoders: like Dan mentioned, it seems like we might not be able to do a > > very efficient job of encoding these very large integers. Stuff like > > bitshuffle, SIMD bitpacking, etc, isn't really designed for such large > > values. So, I'm a little afraid that we'll end up only with PLAIN and > > people will be upset with the storage overhead and performance. > > > Aren't we going to need efficient encodings in order to make decimal work > > well, anyway? > > > We will need to ensure performant encoding exists for INT128 to make > decimals with a precisions >= 18 work well anyway. We should likely have > parity > with the other integer types to reduce any confusion about differing > precisions having different encoding considerations. Although Presto > documents that precision >= 18 are slower than the others. We could do > something similar and follow on with improvements. > > In the current int128 internal patch I know that the RLE doesn't work for > int128. I don't have a lot of background on Kudu's encoding details, so > investigating encodings further is one of my next steps. > That's a good point. However, I'm guessing that users are more likely to intuitively know that "9 digits is enough" more easily than they will know that "64 bits is enough". In my experience people underestimate the range of 64-bit integers and might choose INT128 if available even if they have no need for anywhere near that range. -Todd > > On Thu, Nov 16, 2017 at 5:30 PM, Dan Burkert > wrote: > > > Aren't we going to need efficient encodings in order to make decimal work > > well, anyway? > > > > - Dan > > > > On Thu, Nov 16, 2017 at 2:54 PM, Todd Lipcon wrote: > > > >> On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert > >> wrote: > >> > >> > I think it would be useful. As far as I've seen the main costs in > >> > carrying data types are in writing performant encoders, and updating > >> > integrations to work with them. I'm guessing with 128 bit integers > >> there > >> > would be some integrations that can't or won't support it, which might > >> be a > >> > cause for confusion. Overall, though, I think the upsides of > efficiency > >> > and decreased storage space are compelling. Do you have a sense yet > of > >> > what encodings are going to be supported down the road (will we get to > >> full > >> > parity with 32/64)? > >> > > >> > >> Yea, my concerns are: > >> > >> 1) Integrations: do we have a compatible SQL type to map this to in > Spark > >> SQL, Impala, Presto, etc? What type would we map to in Java? It seems > like > >> the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if > >> we're going to map it the same as decimal anyway, why not just _not_ > >> expose > >> it and only expose decimal? If someone wants to store a 128-bit hash as > a > >> DECIMAL(39) they are free to, of course. Postgres's built-in int types > >> only > >> go up to 64-bit (bigint) > >> > >> In addition to the choice of D
Re: Kudu start error : Failed to initialize sys tables async: on-disk and provided master lists are different
Hi Liou, It sounds like you might have first started the masters without specifying the --master_addresses setting, so they each initialized their own local storage separately with no knowledge of each other. That is to say, each master thinks it is its own cluster, and now is refusing to restart with a different configuration (it's not possible to "merge" clusters after initialization). To recover the situation, I suggest 'rm -rf /var/lib/kudu/master/*' on all three machines before restarting the service with the new configuration. Of course, please note that this will remove all data from your Kudu cluster, if you have any data loaded. If you are instead trying to migrate from a 1-master cluster to 3-master cluster, you can find instructionns to do so in the docs. -Todd On Fri, Nov 17, 2017 at 1:36 AM, Dylan wrote: > Hi: > > I am installing kudu in distributed mode. I wanna 3 > masters(10,11,12) and 3 tservers. > > I configured the master.gflagfile and only added > --master_addresses option. > > > > File content: > > > > > > # Do not modify these two lines. If you wish to change these variables, > > # modify them in /etc/default/kudu-master. > > --fromenv=rpc_bind_addresses > > --fromenv=log_dir > > > > --fs_wal_dir=/var/lib/kudu/master > > --fs_data_dirs=/var/lib/kudu/master > > > > --master_addresses=10.15.213.10:7051,10.15.213.11:7051,10.15.213.12:7051 > > > > > > When I started the kudu-master service, it reported an error. > > `Check failed: _s.ok() Bad status: Invalid argument: Unable to initialize > catalog manager: Failed to initialize sys tables async: on-disk and > provided master lists are different: 10.15.213.10:7051 10.15.213.11:7051 > 10.15.213.12:7051 :0` > > > > It was the same on the other 2 master machine. > > > > I have no idea what’s going on. Am I misunderstanding this configure > option? > > > > > > > > Best wishes. > > > > Liou Fongcyuan > -- Todd Lipcon Software Engineer, Cloudera
Re: INT128 Column Support Interest
On Thu, Nov 16, 2017 at 2:28 PM, Dan Burkert wrote: > I think it would be useful. As far as I've seen the main costs in > carrying data types are in writing performant encoders, and updating > integrations to work with them. I'm guessing with 128 bit integers there > would be some integrations that can't or won't support it, which might be a > cause for confusion. Overall, though, I think the upsides of efficiency > and decreased storage space are compelling. Do you have a sense yet of > what encodings are going to be supported down the road (will we get to full > parity with 32/64)? > Yea, my concerns are: 1) Integrations: do we have a compatible SQL type to map this to in Spark SQL, Impala, Presto, etc? What type would we map to in Java? It seems like the most natural mapping would be DECIMAL(39) or somesuch in SQL. So, if we're going to map it the same as decimal anyway, why not just _not_ expose it and only expose decimal? If someone wants to store a 128-bit hash as a DECIMAL(39) they are free to, of course. Postgres's built-in int types only go up to 64-bit (bigint) In addition to the choice of DECIMAL, for things like fixed-length binary maybe we are better off later adding a fixed-length BINARY type, like BINARY(16) which could be used for storing large hashes? There is precedent for fixed-length CHAR(n) in SQL, but no such precedent for int128. 2) Encoders: like Dan mentioned, it seems like we might not be able to do a very efficient job of encoding these very large integers. Stuff like bitshuffle, SIMD bitpacking, etc, isn't really designed for such large values. So, I'm a little afraid that we'll end up only with PLAIN and people will be upset with the storage overhead and performance. -Todd > > On Thu, Nov 16, 2017 at 2:19 PM, Grant Henke wrote: > >> Hi all, >> >> As a part of adding DECIMAL support to Kudu it was necessary to add >> internal support for 128 bit integers. Taking that one step further and >> supporting public columns and APIs for 128 bit integers would not be too >> much additional work. However, I wanted to gauge the interest from the >> community. >> >> My initial thoughts are that having an INT128 column type could be useful >> for things like UUIDs, IPv6 addresses, MD5 hashes and other similar types >> of data. >> >> Is there any interest or uses for a INT128 column type? Is anyone >> currently using a STRING or BINARY column for 128 bit data? >> >> Thank you, >> Grant >> -- >> Grant Henke >> Software Engineer | Cloudera >> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.
0a5145f in kudu::consensus::RaftConsensus::NotifyTermChange(long) () #9 0x00a6ceac in kudu::consensus::PeerMessageQueue::NotifyObserversOfTermChangeTask(long) () which I also found in his stacks fairly often. To workaround this I suggested disabling the kernel_stack_watchdog using a hidden configuration. I'll look at a fix for this that we can backport to earlier versions as well. King -- please report back in a couple days whether the suggestions we talked about help resolve the problems on your cluster. Thanks -Todd On Thu, Nov 9, 2017 at 11:27 PM, Lee King wrote: > Hi, Todd > Our kudu cluster 's error/warning log just like the > https://issues.apache.org/jira/browse/KUDU-1078, and the issues's status > is reopen, I have upload log for analysis the issues, If you want to more > detail, just tell me 😄。 > log files: > https://drive.google.com/open?id=1_1l2xpT3-NmumgI_sIdxch-6BocXqTCt > https://drive.google.com/open?id=0B4-NyGFtYNboN3NYNW1pVWQwcFVLa083V > kRIUTZHRk85WHY4 > > > > > > > 2017-11-04 13:15 GMT+08:00 Todd Lipcon : > >> One thing you might try is to update the consensus rpc timeout to 30 >> seconds instead of 1. We changed the default in later versions. >> >> I'd also recommend updating up 1.4 or 1.5 for other related fixes to >> consensus stability. I think I recall you were on 1.3 still? >> >> Todd >> >> >> On Nov 3, 2017 7:47 PM, "Lee King" wrote: >> >> Hi, >> Our kudu cluster have ran well a long time, but write became slowly >> recently,client also come out rpc timeout. I check the warning and find >> vast error look this: >> W1104 10:25:16.833736 10271 consensus_peers.cc:365] T >> 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> >> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't >> send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet >> 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC >> to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next >> heartbeat period. Already tried 5 times. >> I change the configure rpc_service_queue_le >> ngth=400,rpc_num_service_threads=40, but it takes no effect. >> Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. >> I check one of the ts machine's memory, 14G left(128 In all), thread >> 4739(max 32000), openfile 28000(max 65536), cpu disk utilization ratio >> about 30%(32 core), disk util less than 30%. >> Any suggestion for this? Thanks! >> >> >> > -- Todd Lipcon Software Engineer, Cloudera
Re: The service queue is full; it has 400 items.. Retrying in the next heartbeat period.
One thing you might try is to update the consensus rpc timeout to 30 seconds instead of 1. We changed the default in later versions. I'd also recommend updating up 1.4 or 1.5 for other related fixes to consensus stability. I think I recall you were on 1.3 still? Todd On Nov 3, 2017 7:47 PM, "Lee King" wrote: Hi, Our kudu cluster have ran well a long time, but write became slowly recently,client also come out rpc timeout. I check the warning and find vast error look this: W1104 10:25:16.833736 10271 consensus_peers.cc:365] T 149ffa58ac274c9ba8385ccfdc01ea14 P 59c768eb799243678ee7fa3f83801316 -> Peer 1c67a7e7ff8f4de494469766641fccd1 (cloud-sk-ds-08:7050): Couldn't send request to peer 1c67a7e7ff8f4de494469766641fccd1 for tablet 149ffa58ac274c9ba8385ccfdc01ea14. Status: Timed out: UpdateConsensus RPC to 10.6.60.9:7050 timed out after 1.000s (SENT). Retrying in the next heartbeat period. Already tried 5 times. I change the configure rpc_service_queue_length=400,rpc_num_service_threads=40, but it takes no effect. Our cluster include 5 master , 10 ts. 3800G data, 800 tablet per ts. I check one of the ts machine's memory, 14G left(128 In all), thread 4739(max 32000), openfile 28000(max 65536), cpu disk utilization ratio about 30%(32 core), disk util less than 30%. Any suggestion for this? Thanks!
Re: Error message: 'Tried to update clock beyond the max. error.'
Thanks Franco. I filed https://issues.apache.org/jira/browse/KUDU-2209 and put up a patch. I'm also going to work on a change to try to allow Kudu to ride over brief interruptions in ntp synchronization status. Hopefully this will help folks who have some issues with occasional ntp instability. -Todd On Wed, Nov 1, 2017 at 6:31 PM, Franco Venturi wrote: > > From 'tablet_bootsratp.cc': > > 1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed: > _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. > error. > > Franco > > > -- > *From: *"Todd Lipcon" > *To: *user@kudu.apache.org > *Sent: *Wednesday, November 1, 2017 8:00:09 PM > *Subject: *Re: Error message: 'Tried to update clock beyond the max. > error.' > > > What's the full log line where you're seeing this crash? Is it coming from > tablet_bootstrap.cc, raft_consensus.cc, or elsewhere? > > -Todd > > 2017-11-01 15:45 GMT-07:00 Franco Venturi : > >> Our version is kudu 1.5.0-cdh5.13.0. >> >> Franco >> >> >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > -- Todd Lipcon Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
Actually I think I understand the root cause of this. I think at some point NTP can switch the clock from a microseconds-based mode to a nanoseconds-based mode, at which point Kudu starts interpreting the results of the ntp_gettime system call incorrectly, resulting in incorrect error estimates and even time values up to 1000 seconds in the future (we read 1 billion nanoseconds as 1 billion microseconds (=1000 seconds)). I'll work on reproducing this and a patch, to backport to previous versions. -Todd On Wed, Nov 1, 2017 at 5:00 PM, Todd Lipcon wrote: > What's the full log line where you're seeing this crash? Is it coming from > tablet_bootstrap.cc, raft_consensus.cc, or elsewhere? > > -Todd > > 2017-11-01 15:45 GMT-07:00 Franco Venturi : > >> Our version is kudu 1.5.0-cdh5.13.0. >> >> Franco >> >> >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
What's the full log line where you're seeing this crash? Is it coming from tablet_bootstrap.cc, raft_consensus.cc, or elsewhere? -Todd 2017-11-01 15:45 GMT-07:00 Franco Venturi : > Our version is kudu 1.5.0-cdh5.13.0. > > Franco > > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: Low ingestion rate from Kafka
On Wed, Nov 1, 2017 at 2:10 PM, Chao Sun wrote: > > Great. Keep in mind that, since you have a UUID component at the front > of your key, you are doing something like a random-write workload. So, as > your data grows, if your PK column (and its bloom filters) ends up being > larger than the available RAM for caching, each write may generate a disk > seek which will make throughput plummet. This is unlike some other storage > options like HBase which does "blind puts". > > > Just something to be aware of, for performance planning. > > Thanks for letting me know. I'll keep a note. > > > I think in 1.3 it was called "kudu test loadgen" and may have fewer > options available. > > Cool. I just run it on one of the TS node ('kudu test loadgen > --num-threads=8 --num-rows-per-thread=100 --table-num-buckets=32'), and > got the following: > > Generator report > time total : 5434.15 ms > time per row: 0.000679268 ms > > ~1.5M / sec? looks good. > yep, sounds about right. My machines I was running on are relatively old spec CPUs and also somewhat overloaded (it's a torture-test cluster of sorts that is always way out of balance, re-replicating stuff, etc) -Todd > > > > On Wed, Nov 1, 2017 at 1:40 PM, Todd Lipcon wrote: > >> On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun wrote: >> >>> Thanks Todd! I improved my code to use multi Kudu clients for processing >>> the Kafka messages and >>> was able to improve the number to 250K - 300K per sec. Pretty happy with >>> this now. >>> >> >> Great. Keep in mind that, since you have a UUID component at the front of >> your key, you are doing something like a random-write workload. So, as your >> data grows, if your PK column (and its bloom filters) ends up being larger >> than the available RAM for caching, each write may generate a disk seek >> which will make throughput plummet. This is unlike some other storage >> options like HBase which does "blind puts". >> >> Just something to be aware of, for performance planning. >> >> >>> >>> Will take a look at the perf tool - looks very nice. It seems it is not >>> available on Kudu 1.3 though. >>> >>> >> I think in 1.3 it was called "kudu test loadgen" and may have fewer >> options available. >> >> -Todd >> >> On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon wrote: >>> >>>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon wrote: >>>> >>>>> Sounds good. >>>>> >>>>> BTW, you can try a quick load test using the 'kudu perf loadgen' >>>>> tool. For example something like: >>>>> >>>>> kudu perf loadgen my-kudu-master.example.com --num-threads=8 >>>>> --num-rows-per-thread=100 --table-num-buckets=32 >>>>> >>>>> There are also a bunch of options to tune buffer sizes, flush options, >>>>> etc. But with the default settings above on an 8-node cluster I have, I >>>>> was >>>>> able to insert 8M rows in 44 seconds (180k/sec). >>>>> >>>>> Adding --buffer-size-bytes=1000 almost doubled the above >>>>> throughput (330k rows/sec) >>>>> >>>> >>>> One more quick datapoint: I ran the above command simultaneously (in >>>> parallel) four times. Despite running 4x as many clients, they all >>>> finished in the same time as a single client did (ie aggregate throughput >>>> ~1.2M rows/sec). >>>> >>>> Again this isn't a scientific benchmark, and it's such a short burst of >>>> activity that it doesn't represent a real workload, but 15k rows/sec is >>>> definitely at least an order of magnitude lower than the peak throughput I >>>> would expect. >>>> >>>> -Todd >>>> >>>> >>>>> >>>>> -Todd >>>>> >>>>> >>>>> >>>>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun wrote: >>>>>>> >>>>>>>> Thanks Zhen and Todd. >>>>>>>> >>>>>>>> Yes increasing the # of consumers will definitely help, but we also >>>>>>>> want to test the b
Re: Kudu background tasks
Hi Janne, It's not clear whether the issue was that it was taking a long time to restart (i.e replaying WALs) or if somehow you also ended up having to re-replicate a bunch of tablets from host to host in the cluster. There were some bugs in earlier versions of Kudu (eg KUDU-2125, KUDU-2020) which could make this process rather slow to stabilize. If this issue happens again, running 'kudu cluster ksck' during the instable period can often yield more information to help understand what is happening. What version are you running? Todd On Wed, Nov 1, 2017 at 1:16 AM, Janne Keskitalo wrote: > Hi > > Our Kudu test environment got unresponsive yesterday for unknown reason. > It has three tablet servers and one master. It's running in AWS on quite > small host machines, so maybe some node ran out of memory or something. It > has happened before with this setup. Anyway, after we restarted kudu > service, we couldn't do any selects. From the tablet server UI I could see > it was initializing and bootstrapping tablets. It took many hours until all > tablets were in RUNNING-state. > > My question is where can I find information about these background > operations? I want to understand what happens in situations when some node > is offline and then comes back up after a while. What is tablet > initialization and bootstrapping, etc. > > -- > Br. > Janne Keskitalo, > Database Architect, PAF.COM > For support: dbdsupp...@paf.com > > -- Todd Lipcon Software Engineer, Cloudera
Re: Low ingestion rate from Kafka
On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun wrote: > Thanks Todd! I improved my code to use multi Kudu clients for processing > the Kafka messages and > was able to improve the number to 250K - 300K per sec. Pretty happy with > this now. > Great. Keep in mind that, since you have a UUID component at the front of your key, you are doing something like a random-write workload. So, as your data grows, if your PK column (and its bloom filters) ends up being larger than the available RAM for caching, each write may generate a disk seek which will make throughput plummet. This is unlike some other storage options like HBase which does "blind puts". Just something to be aware of, for performance planning. > > Will take a look at the perf tool - looks very nice. It seems it is not > available on Kudu 1.3 though. > > I think in 1.3 it was called "kudu test loadgen" and may have fewer options available. -Todd On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon wrote: > >> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon wrote: >> >>> Sounds good. >>> >>> BTW, you can try a quick load test using the 'kudu perf loadgen' tool. >>> For example something like: >>> >>> kudu perf loadgen my-kudu-master.example.com --num-threads=8 >>> --num-rows-per-thread=100 --table-num-buckets=32 >>> >>> There are also a bunch of options to tune buffer sizes, flush options, >>> etc. But with the default settings above on an 8-node cluster I have, I was >>> able to insert 8M rows in 44 seconds (180k/sec). >>> >>> Adding --buffer-size-bytes=1000 almost doubled the above throughput >>> (330k rows/sec) >>> >> >> One more quick datapoint: I ran the above command simultaneously (in >> parallel) four times. Despite running 4x as many clients, they all >> finished in the same time as a single client did (ie aggregate throughput >> ~1.2M rows/sec). >> >> Again this isn't a scientific benchmark, and it's such a short burst of >> activity that it doesn't represent a real workload, but 15k rows/sec is >> definitely at least an order of magnitude lower than the peak throughput I >> would expect. >> >> -Todd >> >> >>> >>> -Todd >>> >>> >>> >>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon >>>> wrote: >>>> >>>>> >>>>> >>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun wrote: >>>>> >>>>>> Thanks Zhen and Todd. >>>>>> >>>>>> Yes increasing the # of consumers will definitely help, but we also >>>>>> want to test the best throughput we can get from Kudu. >>>>>> >>>>> >>>>> Sure, but increasing the number of consumers can increase the >>>>> throughput (without increasing the number of Kudu tablet servers). >>>>> >>>>> Currently, if you run 'top' on the TS nodes, do you see them using a >>>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO >>>>> utilization? My guess is that at 15k/sec you are hardly utilizing the >>>>> nodes, and you're mostly bound by round trip latencies, etc. >>>>> >>>>> >>>>>> >>>>>> I think the default batch size is 1000 rows? >>>>>> >>>>> >>>>> In manual flush mode, it's up to you to determine how big your batches >>>>> are. It will buffer until you call 'Flush()'. So you could wait until >>>>> you've accumulated way more than 1000 to flush. >>>>> >>>>> >>>>>> I tested with a few different options between 1000 and 20, but >>>>>> always got some number between 15K to 20K per sec. Also tried flush >>>>>> background mode and 32 hash partitions but results are similar. >>>>>> >>>>> >>>>> In your AUTO_FLUSH test, were you still calling Flush()? >>>>> >>>>> >>>>>> The primary key is UUID + some string column though - they always >>>>>> come in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, >>>>>> etc. >>>>>> >>>>> >>>>> Given this, are you hash-partitioning on just the UUID portion of the >>>>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the >>>>&
Re: Low ingestion rate from Kafka
On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon wrote: > Sounds good. > > BTW, you can try a quick load test using the 'kudu perf loadgen' tool. > For example something like: > > kudu perf loadgen my-kudu-master.example.com --num-threads=8 > --num-rows-per-thread=100 --table-num-buckets=32 > > There are also a bunch of options to tune buffer sizes, flush options, > etc. But with the default settings above on an 8-node cluster I have, I was > able to insert 8M rows in 44 seconds (180k/sec). > > Adding --buffer-size-bytes=1000 almost doubled the above throughput > (330k rows/sec) > One more quick datapoint: I ran the above command simultaneously (in parallel) four times. Despite running 4x as many clients, they all finished in the same time as a single client did (ie aggregate throughput ~1.2M rows/sec). Again this isn't a scientific benchmark, and it's such a short burst of activity that it doesn't represent a real workload, but 15k rows/sec is definitely at least an order of magnitude lower than the peak throughput I would expect. -Todd > > -Todd > > > >> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon wrote: >> >>> >>> >>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun wrote: >>> >>>> Thanks Zhen and Todd. >>>> >>>> Yes increasing the # of consumers will definitely help, but we also >>>> want to test the best throughput we can get from Kudu. >>>> >>> >>> Sure, but increasing the number of consumers can increase the throughput >>> (without increasing the number of Kudu tablet servers). >>> >>> Currently, if you run 'top' on the TS nodes, do you see them using a >>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO >>> utilization? My guess is that at 15k/sec you are hardly utilizing the >>> nodes, and you're mostly bound by round trip latencies, etc. >>> >>> >>>> >>>> I think the default batch size is 1000 rows? >>>> >>> >>> In manual flush mode, it's up to you to determine how big your batches >>> are. It will buffer until you call 'Flush()'. So you could wait until >>> you've accumulated way more than 1000 to flush. >>> >>> >>>> I tested with a few different options between 1000 and 20, but >>>> always got some number between 15K to 20K per sec. Also tried flush >>>> background mode and 32 hash partitions but results are similar. >>>> >>> >>> In your AUTO_FLUSH test, were you still calling Flush()? >>> >>> >>>> The primary key is UUID + some string column though - they always come >>>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc. >>>> >>> >>> Given this, are you hash-partitioning on just the UUID portion of the >>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the >>> UUID. This should ensure that you get pretty good batching of the writes. >>> >>> Todd >>> >>> >>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon wrote: >>>> >>>>> In addition to what Zhen suggests, I'm also curious how you are sizing >>>>> your batches in manual-flush mode? With 128 hash partitions, each batch is >>>>> generating 128 RPCs, so if for example you are only batching 1000 rows at >>>>> a >>>>> time, you'll end up with a lot of fixed overhead in each RPC to insert >>>>> just >>>>> 1000/128 = ~8 rows. >>>>> >>>>> Generally I would expect an 8 node cluster (even with HDDs) to be able >>>>> to sustain several hundred thousand rows/second insert rate. Of course, it >>>>> depends on the size of the rows and also the primary key you've chosen. If >>>>> your primary key is generally increasing (such as the kafka sequence >>>>> number) then you should have very little compaction and good performance. >>>>> >>>>> -Todd >>>>> >>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang wrote: >>>>> >>>>>> Maybe you can add your consumer number? In my opinion, more threads >>>>>> to insert can give a better throughput. >>>>>> >>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun : >>>>>> >>>>>>> OK. Thanks! I changed to manual flush mode and it
Re: Low ingestion rate from Kafka
On Tue, Oct 31, 2017 at 11:56 PM, Chao Sun wrote: > > Sure, but increasing the number of consumers can increase the throughput > (without increasing the number of Kudu tablet servers). > > I see. Make sense. I'll test that later. > > > Currently, if you run 'top' on the TS nodes, do you see them using a > high amount of CPU? Similar question for 'iostat -dxm 1' - high IO > utilization? My guess is that at 15k/sec you are hardly utilizing the > nodes, and you're mostly bound by round trip latencies, etc. > > From the top and iostat commands, the TS nodes seem pretty under-utilized. > CPU usage is less than 10%. > > > In manual flush mode, it's up to you to determine how big your batches > are. It will buffer until you call 'Flush()'. So you could wait until > you've accumulated way more than 1000 to flush. > > Got it. I meant the default buffer size is 1000 - found out that I need to > bump this up in order to bypass "buffer is too big" error. > > > In your AUTO_FLUSH test, were you still calling Flush()? > > Yes. > OK, in that case, the "Flush()" call is still a synchronous flush. So you may want to only call Flush() infrequently. > > > Given this, are you hash-partitioning on just the UUID portion of the > PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the > UUID. This should ensure that you get pretty good batching of the writes. > > Yes, I only hash-partitioned on the UUID portion. > Sounds good. BTW, you can try a quick load test using the 'kudu perf loadgen' tool. For example something like: kudu perf loadgen my-kudu-master.example.com --num-threads=8 --num-rows-per-thread=100 --table-num-buckets=32 There are also a bunch of options to tune buffer sizes, flush options, etc. But with the default settings above on an 8-node cluster I have, I was able to insert 8M rows in 44 seconds (180k/sec). Adding --buffer-size-bytes=1000 almost doubled the above throughput (330k rows/sec) -Todd > On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon wrote: > >> >> >> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun wrote: >> >>> Thanks Zhen and Todd. >>> >>> Yes increasing the # of consumers will definitely help, but we also want >>> to test the best throughput we can get from Kudu. >>> >> >> Sure, but increasing the number of consumers can increase the throughput >> (without increasing the number of Kudu tablet servers). >> >> Currently, if you run 'top' on the TS nodes, do you see them using a high >> amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization? >> My guess is that at 15k/sec you are hardly utilizing the nodes, and you're >> mostly bound by round trip latencies, etc. >> >> >>> >>> I think the default batch size is 1000 rows? >>> >> >> In manual flush mode, it's up to you to determine how big your batches >> are. It will buffer until you call 'Flush()'. So you could wait until >> you've accumulated way more than 1000 to flush. >> >> >>> I tested with a few different options between 1000 and 20, but >>> always got some number between 15K to 20K per sec. Also tried flush >>> background mode and 32 hash partitions but results are similar. >>> >> >> In your AUTO_FLUSH test, were you still calling Flush()? >> >> >>> The primary key is UUID + some string column though - they always come >>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc. >>> >> >> Given this, are you hash-partitioning on just the UUID portion of the PK? >> ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID. >> This should ensure that you get pretty good batching of the writes. >> >> Todd >> >> >>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon wrote: >>> >>>> In addition to what Zhen suggests, I'm also curious how you are sizing >>>> your batches in manual-flush mode? With 128 hash partitions, each batch is >>>> generating 128 RPCs, so if for example you are only batching 1000 rows at a >>>> time, you'll end up with a lot of fixed overhead in each RPC to insert just >>>> 1000/128 = ~8 rows. >>>> >>>> Generally I would expect an 8 node cluster (even with HDDs) to be able >>>> to sustain several hundred thousand rows/second insert rate. Of course, it >>>> depends on the size of the rows and also the primary key you've chosen. If >>>>
Re: Low ingestion rate from Kafka
On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun wrote: > Thanks Zhen and Todd. > > Yes increasing the # of consumers will definitely help, but we also want > to test the best throughput we can get from Kudu. > Sure, but increasing the number of consumers can increase the throughput (without increasing the number of Kudu tablet servers). Currently, if you run 'top' on the TS nodes, do you see them using a high amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization? My guess is that at 15k/sec you are hardly utilizing the nodes, and you're mostly bound by round trip latencies, etc. > > I think the default batch size is 1000 rows? > In manual flush mode, it's up to you to determine how big your batches are. It will buffer until you call 'Flush()'. So you could wait until you've accumulated way more than 1000 to flush. > I tested with a few different options between 1000 and 20, but always > got some number between 15K to 20K per sec. Also tried flush background > mode and 32 hash partitions but results are similar. > In your AUTO_FLUSH test, were you still calling Flush()? > The primary key is UUID + some string column though - they always come in > batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc. > Given this, are you hash-partitioning on just the UUID portion of the PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID. This should ensure that you get pretty good batching of the writes. Todd > On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon wrote: > >> In addition to what Zhen suggests, I'm also curious how you are sizing >> your batches in manual-flush mode? With 128 hash partitions, each batch is >> generating 128 RPCs, so if for example you are only batching 1000 rows at a >> time, you'll end up with a lot of fixed overhead in each RPC to insert just >> 1000/128 = ~8 rows. >> >> Generally I would expect an 8 node cluster (even with HDDs) to be able to >> sustain several hundred thousand rows/second insert rate. Of course, it >> depends on the size of the rows and also the primary key you've chosen. If >> your primary key is generally increasing (such as the kafka sequence >> number) then you should have very little compaction and good performance. >> >> -Todd >> >> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang wrote: >> >>> Maybe you can add your consumer number? In my opinion, more threads to >>> insert can give a better throughput. >>> >>> 2017-10-31 15:07 GMT+08:00 Chao Sun : >>> >>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K / >>>> sec. :) >>>> >>>> Is there any other tuning I can do to further improve this? and also, >>>> how much would >>>> SSD help in this case (only upsert)? >>>> >>>> Thanks again, >>>> Chao >>>> >>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon >>>> wrote: >>>> >>>>> If you want to manage batching yourself you can use the manual flush >>>>> mode. Easiest would be the auto flush background mode. >>>>> >>>>> Todd >>>>> >>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" wrote: >>>>> >>>>>> Hi Todd, >>>>>> >>>>>> Thanks for the reply! I used a single Kafka consumer to pull the data. >>>>>> For Kudu, I was doing something very simple that basically just >>>>>> follow the example here >>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java> >>>>>> . >>>>>> In specific: >>>>>> >>>>>> loop { >>>>>> Insert insert = kuduTable.newInsert(); >>>>>> PartialRow row = insert.getRow(); >>>>>> // fill the columns >>>>>> kuduSession.apply(insert) >>>>>> } >>>>>> >>>>>> I didn't specify the flushing mode, so it will pick up the >>>>>> AUTO_FLUSH_SYNC as default? >>>>>> should I use MANUAL_FLUSH? >>>>>> >>>>>> Thanks, >>>>>> Chao >>>>>> >>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon >>>>>> wrote: >>>>>> >>>>>>> Hey Chao, >>>>>>> >>>>>>> Nice to hear you are checking
Re: Low ingestion rate from Kafka
In addition to what Zhen suggests, I'm also curious how you are sizing your batches in manual-flush mode? With 128 hash partitions, each batch is generating 128 RPCs, so if for example you are only batching 1000 rows at a time, you'll end up with a lot of fixed overhead in each RPC to insert just 1000/128 = ~8 rows. Generally I would expect an 8 node cluster (even with HDDs) to be able to sustain several hundred thousand rows/second insert rate. Of course, it depends on the size of the rows and also the primary key you've chosen. If your primary key is generally increasing (such as the kafka sequence number) then you should have very little compaction and good performance. -Todd On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang wrote: > Maybe you can add your consumer number? In my opinion, more threads to > insert can give a better throughput. > > 2017-10-31 15:07 GMT+08:00 Chao Sun : > >> OK. Thanks! I changed to manual flush mode and it increased to ~15K / >> sec. :) >> >> Is there any other tuning I can do to further improve this? and also, how >> much would >> SSD help in this case (only upsert)? >> >> Thanks again, >> Chao >> >> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon wrote: >> >>> If you want to manage batching yourself you can use the manual flush >>> mode. Easiest would be the auto flush background mode. >>> >>> Todd >>> >>> On Oct 30, 2017 11:10 PM, "Chao Sun" wrote: >>> >>>> Hi Todd, >>>> >>>> Thanks for the reply! I used a single Kafka consumer to pull the data. >>>> For Kudu, I was doing something very simple that basically just follow >>>> the example here >>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java> >>>> . >>>> In specific: >>>> >>>> loop { >>>> Insert insert = kuduTable.newInsert(); >>>> PartialRow row = insert.getRow(); >>>> // fill the columns >>>> kuduSession.apply(insert) >>>> } >>>> >>>> I didn't specify the flushing mode, so it will pick up the >>>> AUTO_FLUSH_SYNC as default? >>>> should I use MANUAL_FLUSH? >>>> >>>> Thanks, >>>> Chao >>>> >>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon >>>> wrote: >>>> >>>>> Hey Chao, >>>>> >>>>> Nice to hear you are checking out Kudu. >>>>> >>>>> What are you using to consume from Kafka and write to Kudu? Is it >>>>> possible that it is Java code and you are using the SYNC flush mode? That >>>>> would result in a separate round trip for each record and thus very low >>>>> throughput. >>>>> >>>>> Todd >>>>> >>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" wrote: >>>>> >>>>> Hi, >>>>> >>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision >>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster. >>>>> The data are coming from Kafka at a rate of around 30K / sec, and hash >>>>> partitioned into 128 buckets. However, with default settings, Kudu can >>>>> only >>>>> consume the topics at a rate of around 1.5K / second. This is a direct >>>>> ingest with no transformation on the data. >>>>> >>>>> Could this because I was using the default configurations? also we are >>>>> using Kudu on HDD - could that also be related? >>>>> >>>>> Any help would be appreciated. Thanks. >>>>> >>>>> Best, >>>>> Chao >>>>> >>>>> >>>>> >>>> >> > -- Todd Lipcon Software Engineer, Cloudera
Re: Low ingestion rate from Kafka
If you want to manage batching yourself you can use the manual flush mode. Easiest would be the auto flush background mode. Todd On Oct 30, 2017 11:10 PM, "Chao Sun" wrote: > Hi Todd, > > Thanks for the reply! I used a single Kafka consumer to pull the data. > For Kudu, I was doing something very simple that basically just follow the > example here > <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java> > . > In specific: > > loop { > Insert insert = kuduTable.newInsert(); > PartialRow row = insert.getRow(); > // fill the columns > kuduSession.apply(insert) > } > > I didn't specify the flushing mode, so it will pick up the AUTO_FLUSH_SYNC > as default? > should I use MANUAL_FLUSH? > > Thanks, > Chao > > On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon wrote: > >> Hey Chao, >> >> Nice to hear you are checking out Kudu. >> >> What are you using to consume from Kafka and write to Kudu? Is it >> possible that it is Java code and you are using the SYNC flush mode? That >> would result in a separate round trip for each record and thus very low >> throughput. >> >> Todd >> >> On Oct 30, 2017 10:23 PM, "Chao Sun" wrote: >> >> Hi, >> >> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision >> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster. >> The data are coming from Kafka at a rate of around 30K / sec, and hash >> partitioned into 128 buckets. However, with default settings, Kudu can only >> consume the topics at a rate of around 1.5K / second. This is a direct >> ingest with no transformation on the data. >> >> Could this because I was using the default configurations? also we are >> using Kudu on HDD - could that also be related? >> >> Any help would be appreciated. Thanks. >> >> Best, >> Chao >> >> >> >
Re: Low ingestion rate from Kafka
Hey Chao, Nice to hear you are checking out Kudu. What are you using to consume from Kafka and write to Kudu? Is it possible that it is Java code and you are using the SYNC flush mode? That would result in a separate round trip for each record and thus very low throughput. Todd On Oct 30, 2017 10:23 PM, "Chao Sun" wrote: Hi, We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster. The data are coming from Kafka at a rate of around 30K / sec, and hash partitioned into 128 buckets. However, with default settings, Kudu can only consume the topics at a rate of around 1.5K / second. This is a direct ingest with no transformation on the data. Could this because I was using the default configurations? also we are using Kudu on HDD - could that also be related? Any help would be appreciated. Thanks. Best, Chao
Re: 答复: 答复: How kudu synchronize real-time records?
What Helifu said is correct that writes are funneled through the leader. Reads can either be through the leader (which can perform immediately with full consistency) or at a follower. On a follower, the client can choose between the following: a) low consistency: read whatever the follower happens to have. Currently this mode is called READ_LATEST in the source but it should probably be called READ_ANYTHING or READ_INCONSISTENT. It reads "the latest thing that this replica has". b) snapshot consistency at current time: this may cause the follower to wait until it has heard from the leader and knows that it is up-to-date as of the time that the scan started. This gives the same guarantee as reading from the leader but can add some latency c) snapshot consistency in the past: given a timestamp, the follower can know whether it is up-to-date as of that timestamp. If so, it can do a consistent read immediately. Otherwise, it will have to wait, as above. You can learn more about this in the recent blog post authored by David Alves at: https://kudu.apache.org/2017/09/18/kudu-consistency-pt1.html Also please check out the docs at: https://kudu.apache.org/docs/transaction_semantics.html Hope that helps -Todd On Thu, Oct 26, 2017 at 3:18 AM, helifu wrote: > Sorry for my mistake. > The copy replica could be read by clients with below API in client.h: > > Status SetSelection(KuduClient::ReplicaSelection selection) > WARN_UNUSED_RESULT; > > enum ReplicaSelection { > LEADER_ONLY, ///< Select the LEADER replica. > > CLOSEST_REPLICA, ///< Select the closest replica to the > client, > ///< or a random one if all replicas are > equidistant. > > FIRST_REPLICA ///< Select the first replica in the list. > }; > > > 何李夫 > 2017-04-10 16:06:24 > > -邮件原件- > 发件人: user-return-1102-hzhelifu=corp.netease@kudu.apache.org [mailto: > user-return-1102-hzhelifu=corp.netease@kudu.apache.org] 代表 ?? > 发送时间: 2017年10月26日 13:50 > 收件人: user@kudu.apache.org > 主题: Re: 答复: How kudu synchronize real-time records? > > Thanks for replying me. > > It helps a lot. > > 2017-10-26 12:29 GMT+09:00 helifu : > > Hi, > > > > Now the read/write operations are limited to the master replica(record1 > on node1), and the copy replica(record1 on node2/node3) can't be read/write > by clients directly. > > > > > > 何李夫 > > 2017-04-10 11:24:24 > > > > -邮件原件- > > 发件人: user-return-1100-hzhelifu=corp.netease@kudu.apache.org [mailto: > user-return-1100-hzhelifu=corp.netease@kudu.apache.org] 代表 ?? > > 发送时间: 2017年10月26日 10:43 > > 收件人: user@kudu.apache.org > > 主题: How kudu synchronize real-time records? > > > > Hi! > > > > I read from documents saying 'once kudu receives records from client it > write those records into WAL (also does replica)' > > > > And i wonder it can be different time when load those records from WAL > in each node. > > So let's say node1 load record1 from WAL at t1, node2 t2, node3 t3 (t1 < > t2 < t3) then reading client attached node1 can see record but other > reading clients attached not node1(node2, node3) have possibilities missing > record1. > > > > I think that does not happens in kudu, and i wonder how kudu synchronize > real time data. > > > > Thanks! > > > > -- Todd Lipcon Software Engineer, Cloudera
Re: kudu 1.4 kerberos
On Tue, Oct 24, 2017 at 12:41 PM, Todd Lipcon wrote: > I've filed https://issues.apache.org/jira/browse/KUDU-2198 to provide a > workaround for systems like this. I should have a patch up shortly since > it's relatively simple. > > ... and here's the patch, if you want to try it out, Matteo: https://gerrit.cloudera.org/c/8373/ -Todd > -Todd > > On Tue, Oct 17, 2017 at 7:00 PM, Brock Noland wrote: > >> Just one clarification below... >> >> > On Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto < >> m.durighe...@miriade.it> wrote: >> > the "abcdefgh1234" it's an example of the the string created by the >> cloudera manager during the enable kerberos. >> >> ... >> >> On Mon, Oct 16, 2017 at 11:57 PM, Todd Lipcon wrote: >> > Interesting. What is the sAMAccountName in this case? Wouldn't all of >> the 'kudu' have the same account name? >> >> CM generates some random names for cn and sAMAccountName. Below is an >> example created by CM. >> >> dn: CN=uQAtUOSwrA,OU=valhalla-kerberos,OU=Hadoop,DC=phdata,DC=io >> cn: uQAtUOSwrA >> sAMAccountName: uQAtUOSwrA >> userPrincipalName: kudu/worker5.valhalla.phdata...@phdata.io >> servicePrincipalName: kudu/worker5.valhalla.phdata.io >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera
Re: kudu 1.4 kerberos
I've filed https://issues.apache.org/jira/browse/KUDU-2198 to provide a workaround for systems like this. I should have a patch up shortly since it's relatively simple. -Todd On Tue, Oct 17, 2017 at 7:00 PM, Brock Noland wrote: > Just one clarification below... > > > On Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto < > m.durighe...@miriade.it> wrote: > > the "abcdefgh1234" it's an example of the the string created by the > cloudera manager during the enable kerberos. > > ... > > On Mon, Oct 16, 2017 at 11:57 PM, Todd Lipcon wrote: > > Interesting. What is the sAMAccountName in this case? Wouldn't all of > the 'kudu' have the same account name? > > CM generates some random names for cn and sAMAccountName. Below is an > example created by CM. > > dn: CN=uQAtUOSwrA,OU=valhalla-kerberos,OU=Hadoop,DC=phdata,DC=io > cn: uQAtUOSwrA > sAMAccountName: uQAtUOSwrA > userPrincipalName: kudu/worker5.valhalla.phdata...@phdata.io > servicePrincipalName: kudu/worker5.valhalla.phdata.io > -- Todd Lipcon Software Engineer, Cloudera
Re: [DISCUSS] Move Slack discussions to ASF official slack?
On Mon, Oct 23, 2017 at 4:12 PM, Misty Stanley-Jones wrote: > 1. I have no idea, but you could enable the @all at-mention in the > eisting #kudu-general and let people know that way. Also see my next answer. > > Fair enough. > 2. It looks like if you have an apache.org email address you don't need > an invite, but otherwise an existing member needs to invite you. If you can > somehow get all the member email addresses, you can invite them all at once > as a comma-separated list. > I'm not sure if that's doable but potentially. I'm concerned though if we don't have auto-invite for arbitrary community members who just come by a link from our website. A good portion of our traffic is users, rather than developers, and by-and-large they don't have apache.org addresses. If we closed the Slack off to them I think we'd lose a lot of the benefit. > > 3. I can't tell what access there is to integrations. I can try to find > out who administers that on ASF infra and get back with you. I would not be > surprised if integrations with the ASF JIRA were already enabled. > > I pre-emptively grabbed #kudu on the ASF slack in case we decide to go > forward with this. If we don't decide to go forward with it, it's a good > idea to hold onto the channel and pin a message in there about how to get > to the "official" Kudu slack. > > On Mon, Oct 23, 2017 at 3:00 PM, Todd Lipcon wrote: > >> A couple questions about this: >> >> - is there any way we can email out to our existing Slack user base to >> invite them to move over? We have 866 members on our current slack and >> would be a shame if people got confused as to where to go for questions. >> >> - does the ASF slack now have a functioning self-serve "auto-invite" >> service? >> >> - will we still be able to set up integrations like JIRA/github? >> >> -Todd >> >> On Mon, Oct 23, 2017 at 2:53 PM, Misty Stanley-Jones >> wrote: >> >>> When we first started using Slack, I don't think the ASF Slack instance >>> existed. Using our own Slack instance means that we have limited access >>> to >>> message archives (unless we pay) and that people who work on multiple ASF >>> projects need to add the Kudu slack in addition to any other Slack >>> instances they may be on. I propose that we instead create one or more >>> Kudu-related channels on the official ASF slack ( >>> http://the-asf.slack.com/) >>> and migrate our discussions there. What does everyone think? >>> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Kudu - Session.Configuration.FlushMode
Hi Shawn, Answers inline below On Tue, Oct 17, 2017 at 12:59 PM, Shawn Terry wrote: > We ran into a problem today that looks like it might be related to this: > > https://issues.apache.org/jira/browse/KUDU-1891 > > We had a client app crash with this same kind of error: “not enough > mutation buffer space remaining for operation”. Currently the client app > was queuing up a number of writes and doing manual flushing at the end of > the set of transactions. > This means that the configured mutation buffer size for the KuduSession object was not large enough to handle all of the operations that you wrote before flushing. The default is 7MB, but it could be configured safely to be a bit larger at the expense of memory. > We’re using the kudu-python api and would like to better understand the > behavior of the different flushing modes… (assuming > SessionConfiguration.FlushMode is the thing we should be looking at). > Since the Python API wraps the C++ API it's best to look at the C++ client docs here. See https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduSession.html#aaec3956e642610d703f3b83b78e24e19 for docs on the various flush modes. > Are there any global settings to tweak to allow a larger buffer? What > would be the pro’s and con’s of this? > At a certain size you will hit errors that the maximum RPC size has been crossed, and then your writes will fail. Additionally, flushing a larger buffer at a time implies higher latency for that flush (since it's doing more work). > Would explicitly using KuduSession.setFlushMode(AUTO_FLUSH_SYNC) make any > difference? > > AUTO_FLUSH_SYNC means that each operation that you Apply (eg an insert or update) makes its own separate round trip to the appropriate server before responding. This will be very slow if your goal is to stream a high volume of writes into Kudu. It is most appropriate for an online application where you mght want to do only a few inserts in response to some web request, etc. AUTO_FLUSH_BACKGROUND is typically the best choice for a streaming ingest or bulk load scenario since it aims to manage buffer sizes for you automatically for best performance. We'll continue to invest on making AUTO_FLUSH_BACKGROUND work as well as possible for these scenarios. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: [DISCUSS] Move Slack discussions to ASF official slack?
A couple questions about this: - is there any way we can email out to our existing Slack user base to invite them to move over? We have 866 members on our current slack and would be a shame if people got confused as to where to go for questions. - does the ASF slack now have a functioning self-serve "auto-invite" service? - will we still be able to set up integrations like JIRA/github? -Todd On Mon, Oct 23, 2017 at 2:53 PM, Misty Stanley-Jones wrote: > When we first started using Slack, I don't think the ASF Slack instance > existed. Using our own Slack instance means that we have limited access to > message archives (unless we pay) and that people who work on multiple ASF > projects need to add the Kudu slack in addition to any other Slack > instances they may be on. I propose that we instead create one or more > Kudu-related channels on the official ASF slack (http://the-asf.slack.com/ > ) > and migrate our discussions there. What does everyone think? > -- Todd Lipcon Software Engineer, Cloudera
Re: question about connect kudu with python on windows
yep, that's correct. The C++ client isn't able to be built on Windows and would require quite a bit of effort to port. For example, it makes heavy use of 'libev' which has poor support for Windows async IO. -Todd On Mon, Oct 23, 2017 at 6:53 AM, Jordan Birdsell wrote: > Hi Chen, > > As far as I'm aware this is not possible. The python client builds on top > of the kudu client libraries and kudu can't be built on Windows. > > Maybe someone else will have a creative workaround, but I can't think of > one. > > Thanks, > Jordan > > > On Mon, Oct 23, 2017, 4:02 AM Chen Rani wrote: > >> Hi, >> >> >> I am a tester and new to use kudu. I want to connect kudu with python on >> windows, so that I can select the data. >> >> >> Is there any guide about this? >> >> >> I install kudu-python, but meet a problem. What should I do next? >> >> >> >pip install kudu-python >> Collecting kudu-python >> Using cached kudu-python-1.2.0.tar.gz >> Complete output from command python setup.py egg_info: >> Cannot find installed kudu client. >> >> >> Command "python setup.py egg_info" failed with error code 1 in >> c:\users\rani\app >> data\local\temp\pip-build-7ildct\kudu-python >> >> -- Todd Lipcon Software Engineer, Cloudera
Re: kudu 1.4 kerberos
On Mon, Oct 16, 2017 at 2:29 PM, Matteo Durighetto wrote: > Hello Todd, >thank you very much for the answer. I think I have > found something interesting. > > Kudu is doing the ACL list with the sAMAccountName or CN as it writes in > the logs: > > "Logged in from keytab as kudu/@REALM (short username > )" > > I begin to think that the problem is between sssd with plugin > sssd_krb5_localauth_plugin.so > so for every principal kudu/@REALM kudu maps to , > so seems impossible to have the all kudu/@REALM mapped to the same > "kudu" > as suggested "So, basically, it's critical that the username that the > master determines for itself (from this function) > matches the username that it has determined for the tablet servers when > they authenticate > (what you pasted as 'abcdefgh1234' above)." > Interesting. What is the sAMAccountName in this case? Wouldn't all of the 'kudu' have the same account name? > > The strange thing is that with hadoop I have the correct mapping ( > probably because I have no rule, so > it switch to default rule ) > > hadoop org.apache.hadoop.security.HadoopKerberosName kudu/@REALM > ==> kudu > > This makes sense since Hadoop uses its own auth_to_local configuration defined in its core-site.xml rather than relying on the system krb5 library. Kudu, being implemented in C++, uses the system krb5 API to do mapping, and hence picks upthe sssd localauth plugin as you found. As a temporary workaround, one thing you could try is to duplicate your krb5.conf and make a new one like krb5.kudu.conf. In this file remove the localauth plugin and any auth_to_local mappings that are causing trouble. You can then start the kudu servers with KRB5_CONFIG=/path/to/krb5.kudu.conf . I believe this will ensure that the local-auth mapping performs as you'd like. Given your setup, do you think we should provide an advanced option to override the system-configured localauth mapping and instead just use the "simple" mapping of using the short principal name? Generally we'd prefer to have as simple a configuration as possible but if your configuration is relatively commonplace it seems we might want an easier workaround than duplicating krb5.conf. -Todd > > > 2017-10-13 1:32 GMT+02:00 Todd Lipcon : > >> Hey Matteo, >> >> Looks like you did quite a bit of digging in the code! Responses inline >> below. >> >> On Wed, Oct 11, 2017 at 1:24 PM, Matteo Durighetto < >> m.durighe...@miriade.it> wrote: >> >>> Hello, >>>I have a strange behaviour with Kudu 1.4 and kerberos. >>> I enabled kerberos on kudu, I have the principal correctly in the OU of >>> an AD, but >>> at startup I got a lot of errors on method TSHeartbeat between tablet >>> server and >>> master server as unauthorized. There's no firewall between nodes. >>> >> >> right, "unauthorized" indicates that the connection was made fine, but >> the individual RPC call was determined to not be allowed for the identity >> presented on the other side of the connection. >> >> >>> >>> W1011server_base.cc:316] Unauthorized access attempt >>> to method kudu.master.MasterService.TSHeartbeat >>> from {username='abcdefgh1234', principal='kudu/hostn...@domain.xyz'} >>> at :37360 >>> >>> the "abcdefgh1234" it's an example of the the string created by the >>> cloudera manager during the enable kerberos. >>> >> >> This output indicates that it successfully authenticated via Kerberos as >> the principal listed above. That's good news and means you don't need to >> worry about rdns, etc (if you had issues with that it would have had >> trouble finding a service ticket or authenticating the connection). This >> means you got past the "authentication" step and having problems at the >> "authorization" step. >> >> >>> >>> The other services (hdfs and so on ) are under kerberos without problem >>> and there is the rdns at true in the /etc/krb5.conf ( KUDU-2032 ). >>> As I understand the problem is something about the 3° level of >>> authorization between master servers and tablet servers. >>> >> >> Right. >> >> >>> ... >>> So I think the problem, as I say before, could be >>> in ContainsKey(users_, username); : >>> >>> bool SimpleAcl::UserAllowed(const string& username) { >>> return ContainsKey(users_, "*") || ContainsKey(users_, u
Re: Service unavailable: Transaction failed, tablet 2758e5c68e974b92a3060db8575f3621 transaction memory consumption (67031036) has exceeded its limit (67108864) or the limit of an ancestral tracker
after 545 > attempt(s): Failed to write to server: (no server available): Write(tablet: > 2758e5c68e974b92a3060db8575f3621, num_ops: 76249, num_attempts: 545) > passed its deadline: Illegal state: Replica c4ed5cb73f5644a8804d3abc976d02f8 > is not leader of this config. Role: FOLLOWER. Consensus state: > current_term: 10 leader_uuid: "" committed_config { opid_index: 13049 > OBSOLETE_local: false peers { permanent_uuid: " > ad1ea284caff4b07a705c9156b0811cd" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-01" port: 7050 } } peers { permanent_uuid: " > c4ed5cb73f5644a8804d3abc976d02f8" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-02" port: 7050 } } peers { permanent_uuid: " > 067e1e7245154f0fb2720dec6c77feec" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-04" port: 7050 } } } pending_config { opid_index: > 13692 OBSOLETE_local: false peers { permanent_uuid: " > c4ed5cb73f5644a8804d3abc976d02f8" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-02" port: 7050 } } peers { permanent_uuid: " > 067e1e7245154f0fb2720dec6c77feec" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-04" port: 7050 } } } > Error in Kudu table 'hwx_log': Timed out: Failed to write batch of 76249 > ops to tablet 2758e5c68e974b92a3060db8575f3621 after 545 attempt(s): > Failed to write to server: (no server available): Write(tablet: > 2758e5c68e974b92a3060db8575f3621, num_ops: 76249, num_attempts: 545) > passed its deadline: Illegal state: Replica c4ed5cb73f5644a8804d3abc976d02f8 > is not leader of this config. Role: FOLLOWER. Consensus state: > current_term: 10 leader_uuid: "" committed_config { opid_index: 13049 > OBSOLETE_local: false peers { permanent_uuid: " > ad1ea284caff4b07a705c9156b0811cd" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-01" port: 7050 } } peers { permanent_uuid: " > c4ed5cb73f5644a8804d3abc976d02f8" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-02" port: 7050 } } peers { permanent_uuid: " > 067e1e7245154f0fb2720dec6c77feec" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-04" port: 7050 } } } pending_config { opid_index: > 13692 OBSOLETE_local: false peers { permanent_uuid: " > c4ed5cb73f5644a8804d3abc976d02f8" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-02" port: 7050 } } peers { permanent_uuid: " > 067e1e7245154f0fb2720dec6c77feec" member_type: VOTER last_known_addr { > host: "cloud-ocean-kudu-04" port: 7050 } } } (1 of 76249 similar) > > 2017-09-06 14:04 GMT+08:00 Lee King : > >> We got an error about :Service unavailable: Transaction failed, tablet >> 2758e5c68e974b92a3060db8575f3621 transaction memory consumption >> (67031036) has exceeded its limit (67108864) or the limit of an ancestral >> tracker.It looks like https://issues.apache.org/jira/browse/KUDU-1912. >> and the bug will be fix at 1.5,but out version is 1.4,Is there any affect >> for kudu stablity or data consistency? >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: kudu 1.4 kerberos
Hey Matteo, Looks like you did quite a bit of digging in the code! Responses inline below. On Wed, Oct 11, 2017 at 1:24 PM, Matteo Durighetto wrote: > Hello, >I have a strange behaviour with Kudu 1.4 and kerberos. > I enabled kerberos on kudu, I have the principal correctly in the OU of an > AD, but > at startup I got a lot of errors on method TSHeartbeat between tablet > server and > master server as unauthorized. There's no firewall between nodes. > right, "unauthorized" indicates that the connection was made fine, but the individual RPC call was determined to not be allowed for the identity presented on the other side of the connection. > > W1011server_base.cc:316] Unauthorized access attempt > to method kudu.master.MasterService.TSHeartbeat > from {username='abcdefgh1234', principal='kudu/hostn...@domain.xyz'} > at :37360 > > the "abcdefgh1234" it's an example of the the string created by the > cloudera manager during the enable kerberos. > This output indicates that it successfully authenticated via Kerberos as the principal listed above. That's good news and means you don't need to worry about rdns, etc (if you had issues with that it would have had trouble finding a service ticket or authenticating the connection). This means you got past the "authentication" step and having problems at the "authorization" step. > > The other services (hdfs and so on ) are under kerberos without problem > and there is the rdns at true in the /etc/krb5.conf ( KUDU-2032 ). > As I understand the problem is something about the 3° level of > authorization between master servers and tablet servers. > Right. > ... > So I think the problem, as I say before, could be in ContainsKey(users_, > username); : > > bool SimpleAcl::UserAllowed(const string& username) { > return ContainsKey(users_, "*") || ContainsKey(users_, username); > } > > > At this point It's not clear for me how Kudu build the array/key list > users for daemon service ( it's not as super users or user ACL an external > parameter). > Exactly. The users here for the 'service' ACL are set in ServerBase::InitAcls(): boost::optional keytab_user = security::GetLoggedInUsernameFromKeytab(); if (keytab_user) { // If we're logged in from a keytab, then everyone should be, and we expect them // to use the same mapped username. service_user = *keytab_user; } else { // If we aren't logged in from a keytab, then just assume that the services // will be running as the same Unix user as we are. RETURN_NOT_OK_PREPEND(GetLoggedInUser(&service_user), "could not deterine local username"); } Since you're using Kerberos, the top branch here would apply -- it's calling GetLoggedInUsernameFromKeytab() from init.cc. You can see what username the server is getting by looking for a log message at startup like "Logged in from keytab as kudu/@REALM (short username )". Here 'XYZ' is the username that ends up in the service ACL. So, basically, it's critical that the username that the master determines for itself (from this function) matches the username that it has determined for the tablet servers when they authenticate (what you pasted as 'abcdefgh1234' above). That brings us to the next question: how do we convert from a principal like kudu/@REALM to a short "username"? The answer there is the function 'MapPrincipalToLocalName' again from security/init.cc. This function delegates the mapping to the krb5 library itself using the krb5_aname_to_localname() API. The results of this API can vary depending on the kerberos configuration, but in typical configurations it's determined by the 'auth_to_local' configuration in your krb5.conf. See the corresponding section in the docs here: https://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html My guess is that your host has been configured such that when the master maps its own principal, it's getting a different result than when it maps the principal being used by the tservers. Hope that gets you on the right track. Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Change Data Capture (CDC) with Kudu
ce >>>> in the case the primary key is changed), we are trying to devise a process >>>> that will keep the secondary instance in sync with the primary one. The two >>>> instances do not have to be identical in real-time (i.e. we are not looking >>>> for synchronous writes to Kudu), but we would like to have some pretty good >>>> confidence that the secondary instance contains all the changes that the >>>> primary has up to say an hour before (or something like that). >>>> >>>> >>>> So far we considered a couple of options: >>>> - refreshing the seconday instance with a full copy of the primary one >>>> every so often, but that would mean having to transfer say 50TB of data >>>> between the two locations every time, and our network bandwidth constraints >>>> would prevent to do that even on a daily basis >>>> - having a column that contains the most recent time a row was updated, >>>> however this column couldn't be part of the primary key (because the >>>> primary key in Kudu is immutable), and therefore finding which rows have >>>> been changed every time would require a full scan of the table to be >>>> sync'd. It would also rely on the "last update timestamp" column to be >>>> always updated by the application (an assumption that we would like to >>>> avoid), and would need some other process to take into accounts the rows >>>> that are deleted. >>>> >>>> >>>> Since many of today's RDBMS (Oracle, MySQL, etc) allow for some sort of >>>> 'Change Data Capture' mechanism where only the 'deltas' are captured and >>>> applied to the secondary instance, we were wondering if there's any way in >>>> Kudu to achieve something like that (possibly mining the WALs, since my >>>> understanding is that each change gets applied to the WALs first). >>>> >>>> >>>> Thanks, >>>> Franco Venturi >>>> >>> >> > > -- Todd Lipcon Software Engineer, Cloudera
Re: Retrieving multiple records by composite primary key via java api
Hi Mauricio, Currently we don't offer any kind of "multi-scan" or "multi-get" API. So the best bet is probably to use async scanners, separately, one for each range that you need to fetch. Assuming that the scans are short ranges or single keys, each such scanner should result in a single round-trip to the target server. i.e the scanner "open" call will fetch the results and automatically "close" itself assuming that the results fit in a single RPC. So, you may find that performance is pretty good even if the API appears clunky at first. -Todd On Mon, Sep 25, 2017 at 5:01 PM, Mauricio Aristizabal < mauri...@impactradius.com> wrote: > Hi Kudu team > > KuduPredicate.newInListPredicate is very handy when retrieving multiple > records in one scan, but it's not useful to do so by key if it's a > composite primary key. > > What's the proper way to retrieve 50 arbitrary records by composite key? > do we need to do 50 separate (perhaps async) scans? > > I certainly don't see a way in the api docs to join multiple predicates > with an OR operator. > > thanks, > > -m > > -- > *MAURICIO ARISTIZABAL* > Architect - Business Intelligence + Data Science > mauri...@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260> > 223 E. De La Guerra St. | Santa Barbara, CA 93101 > <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> > > Overview <http://www.impactradius.com/?src=slsap> | Twitter > <https://twitter.com/impactradius> | Facebook > <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn > <https://www.linkedin.com/company/impact-radius-inc-> > -- Todd Lipcon Software Engineer, Cloudera
Re: Please tell me about License regarding kudu logo usage
Oops, adding the original poster in case he or she is not subscribed to the list. On Sep 19, 2017 10:46 PM, "Todd Lipcon" wrote: > Hi Yuya, > > There should be no problem to use the Apache Kudu logo in your conference > slides, assuming you are just using as intended to describe or refer to the > project itself. This is considered "nominative use" under trademark laws. > > You can read more about nominative use at: > https://www.apache.org/foundation/marks/#principles > > Thanks for including Kudu in your upcoming talk! I hope you will share the > slides with the community. > > Todd > > > > On Sep 19, 2017 10:43 PM, "野口 裕也" wrote: > > Dear Team Kudu > > I want to use Apache Kudu Logo in my session of JAPAN PHP CONFERENCE 2017 > > Can I use it in my session? > > Please tell me about License regarding kudu logo usage. > > I look forward to hearing from you. > > Best regards, > yuya > > -- > > >