Re: [ANNOUNCE] Welcoming Yingchun Lai as a Kudu committer and PMC member

2019-06-05 Thread Mike Percy
Congrats Yingchun and welcome aboard!

Regards,
Mike

On Wed, Jun 5, 2019 at 11:25 AM Todd Lipcon  wrote:

> Hi Kudu community,
>
> I'm happy to announce that the Kudu PMC has voted to add Yingchun Lai as a
> new committer and PMC member.
>
> Yingchun has been contributing to Kudu for the last 6-7 months and
> contributed a number of bug fixes, improvements, and features, including:
> - new CLI tools (eg 'kudu table scan', 'kudu table copy')
> - fixes for compilation warnings, code cleanup, and usability improvements
> on the web UI
> - support for prioritization of tables for maintenance manager tasks
> - CLI support for config files to make it easier to connect to multi-master
> clusters
>
> Yingchun has also been contributing by helping new users on Slack, and
> helps operate 6 production clusters at Xiaomi, one of our larger
> installations in China.
>
> Please join me in congratulating Yingchun!
>
> -Todd
>


Re: close Kudu client on timeout

2019-01-17 Thread Mike Percy
I have a couple more questions:

 - Did you get a jstack of the process? If so I assume you saw lots of
Netty threads like "New I/O boss", "New I/O worker", etc. because of having
many KuduClient instances. Is that right?
 - Just curious: are your edge node clients in the same data center as Kudu
or are you going across the WAN with your client API writes? This should
not affect client threads but has application architecture implications
(i.e. are you buffering or dropping events at the edge node?) when the WAN
link or the Kudu service is unavailable for some reason.

In general, we recommend sharing Kudu client instances to avoid too many
threads. A single Kudu client and Netty setup should be able to handle all
the threads in the process. An example of this is the static Kudu client
cache we use for the Spark integration at
https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L445

Hope that helps,
Mike


On Thu, Jan 17, 2019 at 11:52 AM Alexey Serbin  wrote:

> Hi Boris,
>
> Kudu servers have a setting for connection inactivity period: idle
> connections to the servers will be automatically closed after the specified
> time (--rpc_default_keepalive_time_ms is the flag).  So, from that
> perspective idle clients is not a big concern to the Kudu server side.
>
> As for your question, right now Kudu doesn't have a way to initiate a
> shutdown of an idle client from the server side.
>
> BTW, I'm curious what it was in your case you reported: were there too
> many idle Kudu client objects around created by the same application?  Or
> that was something else, like a single idle Kudu Java client that created
> so many threads?
>
>
> Thanks,
>
> Alexey
>
> On Wed, Jan 16, 2019 at 1:31 PM Boris Tyukin 
> wrote:
>
>> sorry it is Java
>>
>> On Wed, Jan 16, 2019 at 3:32 PM Mike Percy  wrote:
>>
>>> Java or C++ / Python client?
>>>
>>> Mike
>>>
>>> Sent from my iPhone
>>>
>>> > On Jan 16, 2019, at 12:27 PM, Boris Tyukin 
>>> wrote:
>>> >
>>> > Hi guys,
>>> >
>>> > is there a setting on Kudu server to close/clean-up inactive Kudu
>>> clients?
>>> >
>>> > we just found some rogue code that did not close client on code
>>> completion and wondering if we can prevent this in future on Kudu server
>>> level rather than relying on good developers.
>>> >
>>> > That code caused 22,000 threads opened on our edge node over the last
>>> few days.
>>> >
>>> > Boris
>>>
>>>


Re: close Kudu client on timeout

2019-01-16 Thread Mike Percy
Java or C++ / Python client?

Mike

Sent from my iPhone

> On Jan 16, 2019, at 12:27 PM, Boris Tyukin  wrote:
> 
> Hi guys,
> 
> is there a setting on Kudu server to close/clean-up inactive Kudu clients? 
> 
> we just found some rogue code that did not close client on code completion 
> and wondering if we can prevent this in future on Kudu server level rather 
> than relying on good developers.
> 
> That code caused 22,000 threads opened on our edge node over the last few 
> days.
> 
> Boris



Re: kudu-client dependencies

2019-01-02 Thread Mike Percy
Hi Boris,
kudu-client is a client API library designed to be embedded in a client
application, and it specifies its dependencies via a Maven pom. Typically
one would only want one version of a given dep on the classpath at runtime
and so shipping a fat jar usually isn't done for client libraries.

We shade all dependencies that are not exposed via the public API except
slf4j and related bindings since those are typically provided by the
application (e.g. slf4j-log4j). Since async appears in the public Kudu
Client API we can't shade it.

kudu-client-tools is not a library but a set of command-line tools, so it
has to carry all of its dependencies in the jar.

I'm not sure how most people handle dependency management in the Groovy
world, but a quick Google search turned up Grape
, so
maybe that's worth looking into.

Regards,
Mike


On Wed, Jan 2, 2019 at 12:37 PM Boris Tyukin  wrote:

> ok we just figured out that we need another jar  - kudu-client-tools.jar.
> that one bundled with a proper version async lib and slf4j-api.
>
> slf4j-simple.jar has to be added separately but you do not have to do it
> if it is okay to suppress kudu client logs.
>
> kudu-client.jar and kudu-client-tools.jar are symlinked to a proper
> version of jars for CDH parcel.
>
> /opt/cloudera/parcels/CDH/lib/kudu/kudu-client.jar
> /opt/cloudera/parcels/CDH/lib/kudu/kudu-client-tools.jar
> /opt/cloudera/parcels/CDH/jars/slf4j-simple-1.7.5.jar
>
>
>
> On Wed, Jan 2, 2019 at 2:44 PM Boris Tyukin  wrote:
>
>> Hi guys,
>>
>> sorry for a dumb question but why kudu-client.jar does not include async
>> and slf4j-api and slf4j-simple libs? I need to call Kudu API from a simple
>> groovy script and had to add 3 other jars explicitly.
>>
>> I see these libs were excluded on purpose:
>> https://github.com/apache/kudu/blob/master/java/kudu-client/build.gradle
>>
>> Kafka client, for example, is a single jar.
>>
>> My challenge now which version of these libs to pick, how to support them
>> so they won't break in future etc.
>>
>> Even on CDH cluster, while kudu client is shipped with parcel, one has to
>> know exact versions of other 3 jars for client to work.
>>
>> Maybe I am missing something here and there is an easy way, especially on
>> CDH since it ships already with Kudu.
>>
>> Boris
>>
>


Re: 答复: [KUDU] Rebalancing

2018-11-27 Thread Mike Percy
It is not automatic balancing, but there is a tool that will rebalance
tables across the cluster that can be run and killed at any time with no
negative effects on the latest version of Kudu. The tool also works on
older versions of Kudu, with some caveats.

See
https://kudu.apache.org/releases/1.8.0/docs/administration.html#rebalancer_tool

Mike

On Tue, Nov 27, 2018 at 4:23 AM Дмитрий Павлов  wrote:

> Thanks
>
>
> Вторник, 27 ноября 2018, 14:11 +03:00 от helifu  >:
>
>
> https://kudu.apache.org/releases/1.8.0/docs/release_notes.html#rn_1.8.0_new_features
>
>
>
>
>
> 何李夫
>
> 2018-11-27 19:11:25
>
>
>
> *发件人:* user-return-1542-hzhelifu=corp.netease@kudu.apache.org <
> user-return-1542-hzhelifu=corp.netease@kudu.apache.org> *代表 *Дмитрий
> Павлов
> *发送时间:* 2018年11月27日 18:57
> *收件人:* user 
> *主题:* [KUDU] Rebalancing
>
>
>
> Hi
>
> Does last version of Kudu support auto rebalancing between nodes?
>
> Best Regards, Dmitry Pavlov
>
>
>
> --
> Дмитрий Павлов
>


Community chat on Slack on Tue Nov 13 @ 10am PDT

2018-10-24 Thread Mike Percy
Hi Kudu dev community,

I'm posting this to dev@ and BCC'ing user@ -- let's follow up on the Kudu
dev@ list.

Following up on some previous email threads on the topic of growing the
Kudu community, I would like to know if Kudu developers / interested
community members would be interested in having a real-time chat meeting
(online) to discuss progress and continue those discussions.

*What*: The agenda would be to evaluate progress on and discuss action
items in service of the following goals:

   1. Increase adoption of Kudu in general (and remove barriers to adoption)
   2. Increase the number of contributors to Kudu, especially committers

In addition to reviewing and updating the list of action items, I'd also
like to get volunteers for things that need help to get completed (or
started).

*When / Where*: Let's meet in the #kudu-general chat room on the getkudu
 Slack instance for one hour starting
at 10am PDT on Tuesday, November 13.

For those who can't attend in real-time, the chat history will be available
and I'll send notes to the mailing list afterward, so we can also discuss
the same topics over email after the meeting.

Please let me know if this sounds like something you'd like to take part in
or if you have a suggestion for a better way to coordinate this effort,
want to propose an alternative time, etc.

Please find below the current list of action items compiled by Grant and me.

Thanks,
Mike

--

*Being worked on:*

   - KUDU-2411 : Binary
   artifacts (Linux / macOS) on Maven to enable a Kudu MiniCluster usable by
   external projects - Grant / Mike
   - KUDU-2402 : Gerrit
   Sign In UI bug: we upgraded Gerrit to 2.4.15 but unfortunately it didn't
   fix the issue (we thought this was in the list of fixed issues for 2.4.6).
   We are going to try updating some RewriteRules next - Mike working with
   Cloudera IT, who hosts this infrastructure

*Not being worked on:*

*Increase number of contributors*

   - Support GitHub pull requests (forward to Gerrit?)
   - Create more contributor-focused FAQs and docs (wiki?)
   - Code overview and C++ guidelines article targeted at Java developers
   - Quarterly email to the dev/user lists with links to beginner / newbie
   jiras
   - Video walkthrough of Kudu code base, including how to set up a dev env
   with 
   - Simplify CONTRIBUTING.adoc

*Increase adoption*

*Non-product*

   - Binary artifacts as part of the Apache Kudu release process
  - DEB / RPM packages
  - Tarball releases
  - Ports / Homebrew integration for macOS
   - Full fledged demos / application examples
   - Easy ingest tools for demos, i.e. CLI tools for CSV -> Kudu or similar
   - Schedule regular meetups / hold more talks
   - Improve client APIs to make integration easier / more powerful (need
   specific ideas)
   - More blog posts, including invited blog posts
   - More documentation / blog posts about existing integrations that
   people may not know how to use

*Product improvements*

For now, let's leave big-ticket features off this list -- most are pretty
obvious and they'll take up all the oxygen in the room. Let's reserve this
section for relatively low-effort and high-reward quality-of-life
improvements to the product.

   - TBD


Re: Locks are acquired to cost much time in transactions

2018-09-18 Thread Mike Percy
Why do you think you are spending a lot of time contending on row locks?

Have you tried configuring your clients to send smaller batches? This may
decrease throughput on a per-client basis but will likely improve latency
and reduce the likelihood of row lock contention.

If you are really spending most of your time contending on row locks then
you will likely run into more fundamental performance issues trying to
scale your writes, since Kudu's MVCC implementation effectively stores a
linked list of updates to a given cell until compaction occurs. See
https://github.com/apache/kudu/blob/master/docs/design-docs/tablet.md#historical-mvcc-in-diskrowsets
for more information about the on-disk design.

If you accumulate too many uncompacted mutations against a given row,
reading the latest value for that row at scan time will be slow because it
has to do a lot of work at read time.

Mike

On Tue, Sep 18, 2018 at 8:48 AM Xiaokai Wang  wrote:

> Moved here from JIRA.
>
> Hi guys, I met a problem about the keys locks that almost impacts the
> service normal writing.
>
>
> As we all know, a transaction which get all row_key locks will go on next
> step in kudu. Everything looks good, if keys are not concurrent updated.
> But when keys are updated by more than one client at the same time, locks
> are acquired to wait much time. The cases are often in my product
> environment. Does anybody meet the problem? Has any good ideal for this?
>
>
> In my way, I want to try to abandon keys locks, instead using
> *_pool_token_ 'SERIAL' mode which keeping the key of transaction is serial
> and ordered. Dose this work?
>
>
> Hope to get your advice. Thanks.
>
>
> -
> Regards,
> Xiaokai
>


Re: poor performance on insert into range partitions and scaling

2018-07-31 Thread Mike Percy
Can you post a query profile from Impala for one of the slow insert jobs?

Mike

On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas  wrote:

> Hi,
> wanted share with you the preliminary results of my Kudu testing on AWS
> Created a set of performance tests for evaluation of different instance
> types in AWS and different configurations (Kudu separated from Impala, Kudu
> and Impala on the same nodes); different drive (st1 and gp2) settings and
> here my results:
>
> I was quite dissapointed by the inserts in Step3 see attached sqls,
>
> Any hints, ideas, why this does not scale?
> Thanks
>
>
>


Re: Growing the Kudu community

2018-07-23 Thread Mike Percy
On Mon, Jul 23, 2018 at 10:46 AM Sailesh Mukil 
wrote:

> On Tue, Jul 17, 2018 at 7:37 PM, Mike Percy  wrote:
> > On Tue, Jul 17, 2018 at 2:59 PM Sailesh Mukil 
> wrote:
> >
> > > A suggestion to add on to the easily downloadable pre-built packages,
> is to
> > > have easily accessible/downloadable example test-data that's fairly
> > > representative of real world datasets (but it doesn't have to be too
> > > large). Additionally, we can write tutorials in kudu/examples/ that use
> > > this test data, to give new users a better feel for the system.
> >
> > That sounds useful. Any ideas on where we could find such a data set?
>
> Starting with a small scale factor of TPC-H and TPC-DS might not be a bad
> idea.
>

Once backup and restore has stabilized we could push some example data sets
to S3 and allow people to restore locally from the bucket. That could make
a nice basis for a quickstart tutorial.

Mike


Re: Growing the Kudu community

2018-07-18 Thread Mike Percy
On Wed, Jul 18, 2018 at 8:52 AM Tim Robertson 
wrote:

> Perhaps we should continue this on the dev@ list discussion I started a
> few weeks back [2]?



[2]
> https://lists.apache.org/thread.html/ee697a022b72bbca2761b1af0581773d8fb708f701fc969bc259fc2d@%3Cdev.kudu.apache.org%3E
>


Sure, let's continue the conversation on that thread.

Mike


Re: Growing the Kudu community

2018-07-17 Thread Mike Percy
On Tue, Jul 17, 2018 at 12:22 PM Grant Henke 
wrote:

> I have started a document for blog post ideas/topics here:
>
> https://docs.google.com/document/d/12QFRIhNDMoOI1kOQBgch64xYJ9t6UbyVt1D3NaTl7lI/edit?usp=sharing
>

Nice list, Grant. Actually I think that quarterly email would probably make
for a better blog post instead and I've added it as a suggestion on that
doc.

On Tue, Jul 17, 2018 at 12:04 PM Mauricio Aristizabal 
wrote:

> I was disappointed that Strata SJ 2018 didn't have a single session on
> Kudu, there were no committers in attendance that I could tell, and it
> wasn't being highlighted at all in the Cloudera booth.  Between Strata and
> ScalaDays I must have enthusiastically mentioned the product to 15 people
> and none had heard of it.
>

Hmm, that is disappointing, and a bit surprising. Perhaps everybody thought
everybody else was going to submit... actually I had intended to submit a
talk proposal to Strata this year but got busy and missed the deadline. :(

I wonder if folks using Kudu would like to present on their use case? I'm
sure conference-goers would like to hear from more people using Kudu "in
anger" (hopefully not angrily).

On Tue, Jul 17, 2018 at 2:59 PM Sailesh Mukil 
wrote:

> A suggestion to add on to the easily downloadable pre-built packages, is to
> have easily accessible/downloadable example test-data that's fairly
> representative of real world datasets (but it doesn't have to be too
> large). Additionally, we can write tutorials in kudu/examples/ that use
> this test data, to give new users a better feel for the system.


That sounds useful. Any ideas on where we could find such a data set?

On Tue, Jul 17, 2018 at 11:59 AM Tim Robertson 
wrote:

> ++1 on the mini cluster
> Perhaps include a docker image build at the same time which presumably
> wouldn't be much effort?
>

I'm not really sure there is a lot of overlap between creating a Docker
image and the kind of relocatable artifacts I'm trying to build, aside from
the actual compiling part. But I think it would be valuable for Docker
users to be able to easily pull down a Kudu image.


> l'll be happy to contribute on the Java / maven related parts to that. I
> will use this for the testing framework for the Apache Beam KuduIO and will
> certainly help test / write a blog.
>

I don't really know how to handle the Maven part where we unpack the
tarball and set it up somewhere so we can invoke it from the
KuduMiniCluster. Maybe it that would require writing a custom Maven plugin?

I'd love to see a blog post about how to use Kudu with Beam!

Mike


Growing the Kudu community

2018-07-17 Thread Mike Percy
Hi Apache Kudu community,

Apologies for cross-posting, we just wanted to reach a broad audience for
this topic.

Grant and I have been brainstorming about what we can do to grow the
community of Kudu developers and users. We think Kudu has a lot going for
it, but not everybody knows what it is and what it’s capable of. Focusing
and combining our collective efforts to increase awareness (marketing) and
to reduce barriers to contribution and adoption could be a good way to
achieve organic growth.

We’d like to hear your ideas about what barriers and pain points exist and
any ideas you may have to fix some of those things -- especially ideas
requiring minimal effort and maximum impact.

To kick this off, here are some ideas Grant and I have come up with so far,
in sort of a rough priority order:

Ideas for general improvements

   1. Java MiniCluster support out of the box (KUDU-2411)
   1. This will enable integration with other projects in a way that allows
  them to test against a running Kudu cluster and ensure quality without
  having to build it themselves.
  2. Create a dedicated Maven-consumable java module for a Kudu
  MiniCluster
  3. Pre-built binary artifacts (for testing use only) downloadable
  with MiniCluster (Linux / MacOS)
  4. Ship all dependencies (even security deps, which will not be fixed
  if CVEs found)
  5. Make the binaries Linux distro-independent by building on an old
  distro (EL6)
   2. Upgrade Gerrit to fix the “New UI” GitHub Login Bug (KUDU-2402)
  1. Remove barrier to submitting a patch
  2. Latest version of Gerrit has a fix for the bad GitHub login
  redirect
   3. Upstream pre-built packages for production use (Start rhel7, maybe
   ubuntu)
   1. This is potentially a pretty large effort, depending in the number of
  platforms we want to support
  2. Tarballs -- per-OS / per-distro
  3. Yum install, apt get: per-OS / per-distro
  4. Homebrew?
   4. CLI based tools with zero dependencies for quick experiments/demos
   1. Create, describe, alter tables
  2. Cat data out, pipe data in.
  3. Or simple Python examples to do similar
   5. Create developer oriented docs and faqs (wiki style?)
   6. CONTRIBUTING.adoc in repo
   1. Simplified
  2. Quick “assume nothing tutorial”
  3. Video Guide?

Ongoing marketing and engagement

   1. Quarterly email to the dev / users list
   1. Recognize new contributors
  2. Call out beginner jiras
  3. Summarize ongoing projects
   2. Consistently use the beginner / newbie tag in JIRA
   1. Doc how to find beginner jiras in the contributing docs
   3. Regular blog posts
   1. Developer and community contributors
  2. Invite people from other projects that integrate w/ Kudu to post
  on our Blog
  3. Document how to contribute a blog post
  4. Topics: Compile and maintain a list of blog post ideas in case
  people want inspiration -- Grant has been gathering ideas for this
   4. Archive Slack to a mailing list to be indexed by search engines
   (SlackArchive.io has shut down)

Please offer your suggestions for where we can get a good bang for our
collective buck, and if there is anything you would like to work on by all
means please either speak up or feel free to reach out directly.

Thanks,

Grant and Mike


Re: WAL directory is full

2018-05-14 Thread Mike Percy
Hi Saeid,
What version of Kudu are you running? Do you see any errors when you run
"sudo -u kudu kudu cluster ksck" on the cluster?

Mike

On Fri, May 11, 2018 at 5:12 AM, Saeid Sattari 
wrote:

> Hi all,
>
> I assigned a 100GB SSD disks to WAL on each node in my cluster. Recently,
> I realized that some nodes replicated their cfiles to other nodes due to
> the insufficient space. I found this flag--log_max_segments_to_retain
> 
> that control the number of past logs to preserve but it is currently
> unsupported. Do you have any idea or experience on solving this problem?
> Thank you in advance.
>
> Regards,
> Saeid
>


Re: Spark Streaming + Kudu

2018-03-06 Thread Mike Percy
Hmm, could you try in spark local mode? i.e. https://jaceklaskowski.
gitbooks.io/mastering-apache-spark/content/spark-local.html

Mike

On Tue, Mar 6, 2018 at 7:14 PM, Ravi Kanth  wrote:

> Mike,
>
> Can you clarify a bit on grabbing the jstack for the process? I launched
> my Spark application and tried to get the pid using which I thought I can
> grab jstack trace during hang. Unfortunately, I am not able to figure out
> grabbing pid for Spark application.
>
> Thanks,
> Ravi
>
> On 6 March 2018 at 18:36, Mike Percy  wrote:
>
>> Thanks Ravi. Would you mind attaching the output of jstack on the process
>> during this hang? That would show what the Kudu client threads are doing,
>> as what we are seeing here is just the netty boss thread.
>>
>> Mike
>>
>> On Tue, Mar 6, 2018 at 8:52 AM, Ravi Kanth 
>> wrote:
>>
>>>
>>> Yes, I have debugged to find the root cause. Every logger before "table
>>> = client.openTable(tableName);" is executing fine and exactly at the
>>> point of opening the table, it is throwing the below exception and nothing
>>> is being executed after that. Still the Spark batches are being processed
>>> and at opening the table is failing. I tried catching it with no luck.
>>> Please find below the exception.
>>>
>>> 8/02/23 00:16:30 ERROR client.TabletClient: [Peer
>>> bd91f34d456a4eccaae50003c90f0fb2] Unexpected exception from downstream
>>> on [id: 0x6e13b01f]
>>> java.net.ConnectException: Connection refused:
>>> kudu102.dev.sac.int.threatmetrix.com/10.112.3.12:7050
>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl
>>> .java:717)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.connect(NioClientBoss.java:152)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.process(NioClientBoss.java:79)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket
>>> .nio.NioClientBoss.run(NioClientBoss.java:42)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.util.ThreadRen
>>> amingRunnable.run(ThreadRenamingRunnable.java:108)
>>> at org.apache.kudu.client.shaded.org.jboss.netty.util.internal.
>>> DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1142)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Thanks,
>>> Ravi
>>>
>>> On 5 March 2018 at 23:52, Mike Percy  wrote:
>>>
>>>> Have you considered checking your session error count or pending errors
>>>> in your while loop every so often? Can you identify where your code is
>>>> hanging when the connection is lost (what line)?
>>>>
>>>> Mike
>>>>
>>>> On Mon, Mar 5, 2018 at 9:08 PM, Ravi Kanth 
>>>> wrote:
>>>>
>>>>> In addition to my previous comment, I raised a support ticket for this
>>>>> issue with Cloudera and one of the support person mentioned below,
>>>>>
>>>>> *"Thank you for clarifying, The exceptions are logged but not
>>>>> re-thrown to an upper layer, so that explains why the Spark application is
>>>>> not aware of the underlying error."*
>>>>>
>>>>> On 5 March 2018 at 21:02, Ravi Kanth  wrote:
>>>>>
>>>>>> Mike,
>>>>>>
>>>>>> Thanks for the information. But, once the connection to any of the
>>>>>> Kudu servers is lost then there is no way I can have a control on the
>>>>>> KuduSession object and so with getPendingErrors(). The KuduClient in this
>>>>>> case is becoming a zombie and never returned back till the connection is
>>>>>> properly established. I tried doing all that you have suggested with no
>>>>>> luck. Attaching my KuduClient code.
>>>>>>
>>>>>> package org.

Re: Spark Streaming + Kudu

2018-03-06 Thread Mike Percy
Thanks Ravi. Would you mind attaching the output of jstack on the process
during this hang? That would show what the Kudu client threads are doing,
as what we are seeing here is just the netty boss thread.

Mike

On Tue, Mar 6, 2018 at 8:52 AM, Ravi Kanth  wrote:

>
> Yes, I have debugged to find the root cause. Every logger before "table =
> client.openTable(tableName);" is executing fine and exactly at the point
> of opening the table, it is throwing the below exception and nothing is
> being executed after that. Still the Spark batches are being processed and
> at opening the table is failing. I tried catching it with no luck. Please
> find below the exception.
>
> 8/02/23 00:16:30 ERROR client.TabletClient: [Peer
> bd91f34d456a4eccaae50003c90f0fb2] Unexpected exception from downstream on
> [id: 0x6e13b01f]
> java.net.ConnectException: Connection refused: kudu102.dev.sac.int.
> threatmetrix.com/10.112.3.12:7050
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:717)
> at org.apache.kudu.client.shaded.org.jboss.netty.channel.
> socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
> at org.apache.kudu.client.shaded.org.jboss.netty.channel.
> socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
> at org.apache.kudu.client.shaded.org.jboss.netty.channel.
> socket.nio.NioClientBoss.process(NioClientBoss.java:79)
> at org.apache.kudu.client.shaded.org.jboss.netty.channel.socket.nio.
> AbstractNioSelector.run(AbstractNioSelector.java:337)
> at org.apache.kudu.client.shaded.org.jboss.netty.channel.
> socket.nio.NioClientBoss.run(NioClientBoss.java:42)
> at org.apache.kudu.client.shaded.org.jboss.netty.util.
> ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> at org.apache.kudu.client.shaded.org.jboss.netty.util.internal.
> DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Thanks,
> Ravi
>
> On 5 March 2018 at 23:52, Mike Percy  wrote:
>
>> Have you considered checking your session error count or pending errors
>> in your while loop every so often? Can you identify where your code is
>> hanging when the connection is lost (what line)?
>>
>> Mike
>>
>> On Mon, Mar 5, 2018 at 9:08 PM, Ravi Kanth 
>> wrote:
>>
>>> In addition to my previous comment, I raised a support ticket for this
>>> issue with Cloudera and one of the support person mentioned below,
>>>
>>> *"Thank you for clarifying, The exceptions are logged but not re-thrown
>>> to an upper layer, so that explains why the Spark application is not aware
>>> of the underlying error."*
>>>
>>> On 5 March 2018 at 21:02, Ravi Kanth  wrote:
>>>
>>>> Mike,
>>>>
>>>> Thanks for the information. But, once the connection to any of the Kudu
>>>> servers is lost then there is no way I can have a control on the
>>>> KuduSession object and so with getPendingErrors(). The KuduClient in this
>>>> case is becoming a zombie and never returned back till the connection is
>>>> properly established. I tried doing all that you have suggested with no
>>>> luck. Attaching my KuduClient code.
>>>>
>>>> package org.dwh.streaming.kudu.sparkkudustreaming;
>>>>
>>>> import java.util.HashMap;
>>>> import java.util.Iterator;
>>>> import java.util.Map;
>>>> import org.apache.hadoop.util.ShutdownHookManager;
>>>> import org.apache.kudu.client.*;
>>>> import org.apache.spark.api.java.JavaRDD;
>>>> import org.slf4j.Logger;
>>>> import org.slf4j.LoggerFactory;
>>>> import org.dwh.streaming.kudu.sparkkudustreaming.constants.SpecialN
>>>> ullConstants;
>>>>
>>>> public class KuduProcess {
>>>> private static Logger logger = LoggerFactory.getLogger(KuduPr
>>>> ocess.class);
>>>> private KuduTable table;
>>>> private KuduSession session;
>>>>
>>>> public static void upsertKudu(JavaRDD> rdd, String
>>>> host, String tableName) {
>>>> rdd.foreachPartition(iterator -> {
>>>> RowErrorsAndOverflowStatus errors = upsertOpIterator(iterator,
>>>> tableName, host);
>>>> int errorCount = errors.getRowErrors().length;
>&g

Re: Spark Streaming + Kudu

2018-03-05 Thread Mike Percy
Have you considered checking your session error count or pending errors in
your while loop every so often? Can you identify where your code is hanging
when the connection is lost (what line)?

Mike

On Mon, Mar 5, 2018 at 9:08 PM, Ravi Kanth  wrote:

> In addition to my previous comment, I raised a support ticket for this
> issue with Cloudera and one of the support person mentioned below,
>
> *"Thank you for clarifying, The exceptions are logged but not re-thrown to
> an upper layer, so that explains why the Spark application is not aware of
> the underlying error."*
>
> On 5 March 2018 at 21:02, Ravi Kanth  wrote:
>
>> Mike,
>>
>> Thanks for the information. But, once the connection to any of the Kudu
>> servers is lost then there is no way I can have a control on the
>> KuduSession object and so with getPendingErrors(). The KuduClient in this
>> case is becoming a zombie and never returned back till the connection is
>> properly established. I tried doing all that you have suggested with no
>> luck. Attaching my KuduClient code.
>>
>> package org.dwh.streaming.kudu.sparkkudustreaming;
>>
>> import java.util.HashMap;
>> import java.util.Iterator;
>> import java.util.Map;
>> import org.apache.hadoop.util.ShutdownHookManager;
>> import org.apache.kudu.client.*;
>> import org.apache.spark.api.java.JavaRDD;
>> import org.slf4j.Logger;
>> import org.slf4j.LoggerFactory;
>> import org.dwh.streaming.kudu.sparkkudustreaming.constants.SpecialN
>> ullConstants;
>>
>> public class KuduProcess {
>> private static Logger logger = LoggerFactory.getLogger(KuduPr
>> ocess.class);
>> private KuduTable table;
>> private KuduSession session;
>>
>> public static void upsertKudu(JavaRDD> rdd, String
>> host, String tableName) {
>> rdd.foreachPartition(iterator -> {
>> RowErrorsAndOverflowStatus errors = upsertOpIterator(iterator, tableName,
>> host);
>> int errorCount = errors.getRowErrors().length;
>> if(errorCount > 0){
>> throw new RuntimeException("Failed to write " + errorCount + " messages
>> into Kudu");
>> }
>> });
>> }
>> private static RowErrorsAndOverflowStatus 
>> upsertOpIterator(Iterator> Object>> iter, String tableName, String host) {
>> try {
>> AsyncKuduClient asyncClient = KuduConnection.getAsyncClient(host);
>> KuduClient client = asyncClient.syncClient();
>> table = client.openTable(tableName);
>> session = client.newSession();
>> session.setFlushMode(SessionConfiguration.FlushMode.AUTO_FLU
>> SH_BACKGROUND);
>> while (iter.hasNext()) {
>> upsertOp(iter.next());
>> }
>> } catch (KuduException e) {
>> logger.error("Exception in upsertOpIterator method", e);
>> }
>> finally{
>> try {
>> session.close();
>> } catch (KuduException e) {
>> logger.error("Exception in Connection close", e);
>> }
>> }
>> return session.getPendingErrors();-> Once,
>> the connection is lost, this part of the code never gets called and the
>> Spark job will keep on running and processing the records while the
>> KuduClient is trying to connect to Kudu. Meanwhile, we are loosing all the
>> records.
>> }
>> public static void upsertOp(Map formattedMap) {
>> if (formattedMap.size() != 0) {
>> try {
>> Upsert upsert = table.newUpsert();
>> PartialRow row = upsert.getRow();
>> for (Map.Entry entry : formattedMap.entrySet()) {
>> if (entry.getValue().getClass().equals(String.class)) {
>> if (entry.getValue().equals(SpecialNullConstants.specialStringNull))
>> row.setNull(entry.getKey());
>> else
>> row.addString(entry.getKey(), (String) entry.getValue());
>> } else if (entry.getValue().getClass().equals(Long.class)) {
>> if (entry.getValue().equals(SpecialNullConstants.specialLongNull))
>> row.setNull(entry.getKey());
>> else
>> row.addLong(entry.getKey(), (Long) entry.getValue());
>> } else if (entry.getValue().getClass().equals(Integer.class)) {
>> if (entry.getValue().equals(SpecialNullConstants.specialIntNull))
>> row.setNull(entry.getKey());
>> else
>> row.addInt(entry.getKey(), (Integer) entry.getValue());
>> }
>> }
>>
>> session.apply(upsert);
>> } catch (Exception e) {
>> logger.error("Exception during upsert:", e);
>> }
>> }
>> }
>> }
>> class KuduConnection {
>> private static Logger logger = LoggerFactory.getLogger(KuduCo
>> nnection.class);
>> private static Map asyncCache = new Has

Re: Spark Streaming + Kudu

2018-03-05 Thread Mike Percy
Hi Ravi, it would be helpful if you could attach what you are getting back
from getPendingErrors() -- perhaps from dumping RowError.toString() from
items in the returned array -- and indicate what you were hoping to get
back. Note that a RowError can also return to you the Operation
<https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/RowError.html#getOperation-->
that you used to generate the write. From the Operation, you can get the
original PartialRow
<https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/PartialRow.html>
object, which should be able to identify the affected row that the write
failed for. Does that help?

Since you are using the Kudu client directly, Spark is not involved from
the Kudu perspective, so you will need to deal with Spark on your own in
that case.

Mike

On Mon, Mar 5, 2018 at 1:59 PM, Ravi Kanth  wrote:

> Hi Mike,
>
> Thanks for the reply. Yes, I am using AUTO_FLUSH_BACKGROUND.
>
> So, I am trying to use Kudu Client API to perform UPSERT into Kudu and I
> integrated this with Spark. I am trying to test a case where in if any of
> Kudu server fails. So, in this case, if there is any problem in writing,
> getPendingErrors() should give me a way to handle these errors so that I
> can successfully terminate my Spark Job. This is what I am trying to do.
>
> But, I am not able to get a hold of the exceptions being thrown from with
> in the KuduClient when retrying to connect to Tablet Server. My
> getPendingErrors is not getting ahold of these exceptions.
>
> Let me know if you need more clarification. I can post some Snippets.
>
> Thanks,
> Ravi
>
> On 5 March 2018 at 13:18, Mike Percy  wrote:
>
>> Hi Ravi, are you using AUTO_FLUSH_BACKGROUND
>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html>?
>> You mention that you are trying to use getPendingErrors()
>> <https://kudu.apache.org/releases/1.6.0/apidocs/org/apache/kudu/client/KuduSession.html#getPendingErrors-->
>>  but
>> it sounds like it's not working for you -- can you be more specific about
>> what you expect and what you are observing?
>>
>> Thanks,
>> Mike
>>
>>
>>
>> On Mon, Feb 26, 2018 at 8:04 PM, Ravi Kanth 
>> wrote:
>>
>>> Thank Clifford. We are running Kudu 1.4 version. Till date we didn't see
>>> any issues in production and we are not losing tablet servers. But, as part
>>> of testing I have to generate few unforeseen cases to analyse the
>>> application performance. One among that is bringing down the tablet server
>>> or master server intentionally during which I observed the loss of records.
>>> Just wanted to test cases out of the happy path here. Once again thanks for
>>> taking time to respond to me.
>>>
>>> - Ravi
>>>
>>> On 26 February 2018 at 19:58, Clifford Resnick 
>>> wrote:
>>>
>>>> I'll have to get back to you on the code bits, but I'm pretty sure
>>>> we're doing simple sync batching. We're not in production yet, but after
>>>> some months of development I haven't seen any failures, even when pushing
>>>> load doing multiple years' backfill. I think the real question is why are
>>>> you losing tablet servers? The only instability we ever had with Kudu was
>>>> when it had that weird ntp sync issue that was fixed I think for 1.6. What
>>>> version are you running?
>>>>
>>>> Anyway I would think that infinite loop should be catchable somewhere.
>>>> Our pipeline is set to fail/retry with Flink snapshots. I imagine there is
>>>> similar with Spark. Sorry I cant be of more help!
>>>>
>>>>
>>>>
>>>> On Feb 26, 2018 9:10 PM, Ravi Kanth  wrote:
>>>>
>>>> Cliff,
>>>>
>>>> Thanks for the response. Well, I do agree that its simple and seamless.
>>>> In my case, I am able to upsert ~25000 events/sec into Kudu. But, I am
>>>> facing the problem when any of the Kudu Tablet or master server is down. I
>>>> am not able to get a hold of the exception from client. The client is going
>>>> into an infinite loop trying to connect to Kudu. Meanwhile, I am loosing my
>>>> records. I tried handling the errors through getPendingErrors() but still
>>>> it is helpless. I am using AsyncKuduClient to establish the connection and
>>>> retrieving the syncClient from the Async to open the session and table. Any
>>>> help?
>>>>
>>>> Thanks,
>>>> Ravi
>>>>
>>>> On 26 February 2018 at 18:00, Cliff Resnick  wrote:
>>>>
>>>> While I can't speak for Spark, we do use the client API from Flink
>>>> streaming and it's simple and seamless. It's especially nice if you require
>>>> an Upsert semantic.
>>>>
>>>> On Feb 26, 2018 7:51 PM, "Ravi Kanth"  wrote:
>>>>
>>>> Hi,
>>>>
>>>> Anyone using Spark Streaming to ingest data into Kudu and using Kudu
>>>> Client API to do so rather than the traditional KuduContext API? I am stuck
>>>> at a point and couldn't find a solution.
>>>>
>>>> Thanks,
>>>> Ravi
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Re: Spark Streaming + Kudu

2018-03-05 Thread Mike Percy
Hi Ravi, are you using AUTO_FLUSH_BACKGROUND
?
You mention that you are trying to use getPendingErrors()

but
it sounds like it's not working for you -- can you be more specific about
what you expect and what you are observing?

Thanks,
Mike



On Mon, Feb 26, 2018 at 8:04 PM, Ravi Kanth  wrote:

> Thank Clifford. We are running Kudu 1.4 version. Till date we didn't see
> any issues in production and we are not losing tablet servers. But, as part
> of testing I have to generate few unforeseen cases to analyse the
> application performance. One among that is bringing down the tablet server
> or master server intentionally during which I observed the loss of records.
> Just wanted to test cases out of the happy path here. Once again thanks for
> taking time to respond to me.
>
> - Ravi
>
> On 26 February 2018 at 19:58, Clifford Resnick 
> wrote:
>
>> I'll have to get back to you on the code bits, but I'm pretty sure we're
>> doing simple sync batching. We're not in production yet, but after some
>> months of development I haven't seen any failures, even when pushing load
>> doing multiple years' backfill. I think the real question is why are you
>> losing tablet servers? The only instability we ever had with Kudu was when
>> it had that weird ntp sync issue that was fixed I think for 1.6. What
>> version are you running?
>>
>> Anyway I would think that infinite loop should be catchable somewhere.
>> Our pipeline is set to fail/retry with Flink snapshots. I imagine there is
>> similar with Spark. Sorry I cant be of more help!
>>
>>
>>
>> On Feb 26, 2018 9:10 PM, Ravi Kanth  wrote:
>>
>> Cliff,
>>
>> Thanks for the response. Well, I do agree that its simple and seamless.
>> In my case, I am able to upsert ~25000 events/sec into Kudu. But, I am
>> facing the problem when any of the Kudu Tablet or master server is down. I
>> am not able to get a hold of the exception from client. The client is going
>> into an infinite loop trying to connect to Kudu. Meanwhile, I am loosing my
>> records. I tried handling the errors through getPendingErrors() but still
>> it is helpless. I am using AsyncKuduClient to establish the connection and
>> retrieving the syncClient from the Async to open the session and table. Any
>> help?
>>
>> Thanks,
>> Ravi
>>
>> On 26 February 2018 at 18:00, Cliff Resnick  wrote:
>>
>> While I can't speak for Spark, we do use the client API from Flink
>> streaming and it's simple and seamless. It's especially nice if you require
>> an Upsert semantic.
>>
>> On Feb 26, 2018 7:51 PM, "Ravi Kanth"  wrote:
>>
>> Hi,
>>
>> Anyone using Spark Streaming to ingest data into Kudu and using Kudu
>> Client API to do so rather than the traditional KuduContext API? I am stuck
>> at a point and couldn't find a solution.
>>
>> Thanks,
>> Ravi
>>
>>
>>
>>
>


Re: swap data in Kudu table

2018-02-23 Thread Mike Percy
Hi Boris, those are good ideas. Currently Kudu does not have atomic bulk
load capabilities or staging abilities. Theoretically renaming a partition
atomically shouldn't be that hard to implement, since it's just a master
metadata operation which can be done atomically, but it's not yet
implemented.

There is a JIRA to track a generic bulk load API here:
https://issues.apache.org/jira/browse/KUDU-1370

Since I couldn't find anything to track the specific features you
mentioned, I just filed the following improvement JIRAs so we can track it:

   - KUDU-2326: Support atomic bulk load operation
   
   - KUDU-2327: Support atomic swap of tables or partitions
   

Mike

On Thu, Feb 22, 2018 at 6:39 AM, Boris Tyukin  wrote:

> Hello,
>
> I am trying to figure out the best and safest way to swap data in a
> production Kudu table with data from a staging table.
>
> Basically, once in a while we need to perform a full reload of some tables
> (once in a few months). These tables are pretty large with billions of rows
> and we want to minimize the risk and downtime for users if something bad
> happens in the middle of that process.
>
> With Hive and Impala on HDFS, we can use a very cool handy command LOAD
> DATA INPATH. We can prepare data for reload in a staging table upfront and
> this process might take many hours. Once staging table is ready, we can
> issue LOAD DATA INPATH command which will move underlying HDFS files to a
> production table - this operation is almost instant and the very last step
> in our pipeline.
>
> Alternatively, we can swap partitions using ALTER TABLE EXCHANGE PARTITION
> command.
>
> Now with Kudu, I cannot seem to find a good strategy. The only thing came
> to my mind is to drop the production table and rename a staging table to
> production table as the last step of the job, but in this case we are going
> to lose statistics and security permissions.
>
> Any other ideas?
>
> Thanks!
> Boris
>


Re: Kudu-help needed

2018-02-23 Thread Mike Percy
Hi Pranab,
Sorry for missing your post earlier. To use Kerberos authentication, you
will need to kinit to the kdc in the shell before starting the client or
use something like the Java Krb5LoginModule

class.
The Kudu java client will take the credentials from the environment or the
security context.

The Kudu java API docs can be found here:
https://kudu.apache.org/releases/1.6.0/apidocs/

Hope that helps,
Mike

On Wed, Feb 21, 2018 at 12:24 PM, Pranab Batsa  wrote:

> Hai, I am very new to the kudu and to the distributed database. I am
> trying to make a java application using kudu-client1.4 api. Can you  please
> help me out that , how can i implement authentication using the api, like
> kerberos ?.
>


Re: [ANNOUNCE] New committers over past several months

2017-12-18 Thread Mike Percy
Well deserved for all! Congratulations belated and otherwise to Andrew, Grant, 
and Hao!

Mike

> On Dec 18, 2017, at 9:00 PM, Todd Lipcon  wrote:
> 
> Hi Kudu community,
> 
> I'm pleased to announce that the Kudu PMC has voted to add Andrew Wong,
> Grant Henke, and Hao Hao as Kudu committers and PMC members. This
> announcement is a bit delayed, but I figured it's better late than never!
> 
> Andrew has contributed to Kudu in a bunch of areas. Most notably, he
> authored a bunch of optimizations for predicate evaluation on the read
> path, and recently has led the effort to introduce better tolerance of disk
> failures within the tablet server. In addition to code, Andrew has been a
> big help with questions on the user mailing list, Slack, and elsewhere.
> 
> Grant's contributions have spanned several areas. Notably, he made a bunch
> of improvements to our Java and Scala builds -- an area where others might
> be shy. He also implemented checksum verification for data blocks and has
> begun working on a design for DECIMAL, one of the most highly-requested
> features.
> 
> Hao has also been contributing to Kudu for quite some time. Her notable
> contributions include improved fault tolerance for the Java client, fixes
> and optimizations on the Spark integration, and some important refactorings
> and performance optimizations in the block layer. Hao has also represented
> the community by giving talks about Kudu at a conference in China.
> 
> Please join me in congratulating the new committers and PMC members!
> 
> -Todd



Re: 关于Kudu Mapreduce 程序的问题

2017-12-13 Thread Mike Percy
Can you please post your code and explain the problem you're seeing?

On Tue, Dec 12, 2017 at 11:00 PM, zha...@broadtech.com.cn <
zha...@broadtech.com.cn> wrote:

> Hi,
>  I have a MapReduce program that uses KuduTableInputFormat to read
> data from a kudu table and write it to another kudu
> table. I read from some tables, it get all the data, but some
> tables can only get a small part of the data.
> I found some rules through the test:
> 1. Questionable tables are created a few months ago, no
> problem table are new
> 2. The problem table Through Imapla-shell can get all the data
> 3. Some of the data in the table is exported to a new table,
> can find out all the data
> 4. By kudu cluster ksck, all the tables are normal
>I want to know what caused this problem can help me?
> --
> zha...@broadtech.com.cn
>


[ANNOUNCE] Apache Kudu 1.6.0 released

2017-12-07 Thread Mike Percy
The Apache Kudu team is happy to announce the release of Kudu 1.6.0.

Kudu is an open source storage engine for structured data that supports
low-latency random access together with efficient analytical access
patterns.
It is designed within the context of the Apache Hadoop ecosystem and
supports
many integrations with other data analytics projects both inside and
outside of
the Apache Software Foundation.

Apache Kudu 1.6.0 is a minor release that offers several new features,
improvements, optimizations, and bug fixes. Please see the release notes for
details.

Download it here: https://kudu.apache.org/releases/1.6.0/
Full release notes:
https://kudu.apache.org/releases/1.6.0/docs/release_notes.html

Regards,
The Apache Kudu team


Re: Confused where to post user type questions

2017-11-29 Thread Mike Percy
Hi Boris,
Thanks again for asking about this and I'm happy that you enjoyed listening
to me blab about Kudu! Mark Rittman and the Roaring Elephant guys were very
kind and fun to talk to. I'll note that I think the more recent one (with
Roaring Elephant) had the better audio quality of the two recordings as a
result of the equipment I used.

Mike

On Wed, Nov 29, 2017 at 5:24 PM, Boris Tyukin  wrote:

> totally makes sense to me. thanks Mike and Andrew.
>
> Mike, on a side note, I was just listening to a drill to detail and
> roaring elephant podcasts featuring you :) you did a really great job
> explaining Kudu's role in Big Data ecosystem, I enjoyed both episodes and
> they were one year apart I think so it was interesting to see how Kudu had
> been evolving over the past year.
>
> Thanks,
> Boris
>
> On Wed, Nov 29, 2017 at 5:50 PM, Mike Percy  wrote:
>
>> Hi Boris,
>> Here's my 2 cents. To some extent, chat vs email is a matter of personal
>> preference and we try to support both.
>>
>> Personally I think Slack is nice for instant feedback when you can get
>> it, but email lists are better for questions. Chat channels are a kind of
>> stream-of-conversations and I often find that it's easy to miss someone's
>> comment or question while I'm in the middle of a discussion or when it's
>> been a busy day and there was a lot of activity while I was away.
>>
>> Email threads have subject lines that make them hard to miss, plus they
>> are indexed by Google, which is helpful for others who have the same
>> question in the future. My recommendation would be to use this email list
>> as much as you're comfortable with, and I hope we can encourage more people
>> to use it because of the previously-stated benefits as well as the ability
>> to communicate with people who are not in your local time zone.
>>
>> Regarding the Cloudera forums, it's not something I'd recommend in an
>> Apache context because we can't rely on it for Apache releases. Only
>> Cloudera's software releases are supported there. We need to provide an
>> avenue to support Apache software releases, so this email list (
>> user@kudu.apache.org) and Slack provide the basis for that.
>>
>> Hope that helps. Thank you for asking this question and please continue
>> to raise any concerns with us when you're unable to get the help you need.
>>
>> Mike
>>
>> On Wed, Nov 29, 2017 at 9:19 AM, Andrew Wong  wrote:
>>
>>> Hi Boris,
>>>
>>> Thanks for reaching out! Yeah, currently the most active place to ask
>>> questions is the Kudu slack #kudu-general channel. Sometimes we talk about
>>> dev stuff, but it is also a place user questions. Given its activity,
>>> sometimes user questions fall through the cracks, although we try to avoid
>>> this as much as possible.
>>>
>>> You raise a good point though: for a new user, it might seem like the
>>> wrong place to ask questions if there are a bunch of dev conversations
>>> going on. There have been discussions in the past to migrate those
>>> discussions to a #kudu-dev or something similar. Would be interested in
>>> seeing whether others think it's time to bring this to fruition.
>>>
>>> I should also point out that the Cloudera Community forums are also a
>>> nice platform for Q&A. There's a board for Impala, where Kudu questions are
>>> often asked, so feel free to ask questions there too!
>>>
>>> On Wed, Nov 29, 2017 at 7:03 AM, Boris Tyukin 
>>> wrote:
>>>
>>>> Hi folks,
>>>>
>>>> as a new user to Kudu, it is confusing what is the best venue to post
>>>> user type questions about Kudu which is important for any thriving open
>>>> source project. I have posted some questions on slack and got a feeling
>>>> they were not welcome there as discussions on slack seem to be focused on
>>>> development. I can see that slack group is very active though.
>>>>
>>>> the user group is not that active though with like 10 email threads
>>>> this month.
>>>>
>>>> Can someone clarify this for us newcomers?
>>>>
>>>> We also have official channel for paying CDH customers but there are
>>>> benefits to use informal ones :)
>>>>
>>>> Thanks for such an amazing product and everything you do!
>>>>
>>>> Boris
>>>>
>>>
>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>
>>
>


Re: Confused where to post user type questions

2017-11-29 Thread Mike Percy
Hi Boris,
Here's my 2 cents. To some extent, chat vs email is a matter of personal
preference and we try to support both.

Personally I think Slack is nice for instant feedback when you can get it,
but email lists are better for questions. Chat channels are a kind of
stream-of-conversations and I often find that it's easy to miss someone's
comment or question while I'm in the middle of a discussion or when it's
been a busy day and there was a lot of activity while I was away.

Email threads have subject lines that make them hard to miss, plus they are
indexed by Google, which is helpful for others who have the same question
in the future. My recommendation would be to use this email list as much as
you're comfortable with, and I hope we can encourage more people to use it
because of the previously-stated benefits as well as the ability to
communicate with people who are not in your local time zone.

Regarding the Cloudera forums, it's not something I'd recommend in an
Apache context because we can't rely on it for Apache releases. Only
Cloudera's software releases are supported there. We need to provide an
avenue to support Apache software releases, so this email list (
user@kudu.apache.org) and Slack provide the basis for that.

Hope that helps. Thank you for asking this question and please continue to
raise any concerns with us when you're unable to get the help you need.

Mike

On Wed, Nov 29, 2017 at 9:19 AM, Andrew Wong  wrote:

> Hi Boris,
>
> Thanks for reaching out! Yeah, currently the most active place to ask
> questions is the Kudu slack #kudu-general channel. Sometimes we talk about
> dev stuff, but it is also a place user questions. Given its activity,
> sometimes user questions fall through the cracks, although we try to avoid
> this as much as possible.
>
> You raise a good point though: for a new user, it might seem like the
> wrong place to ask questions if there are a bunch of dev conversations
> going on. There have been discussions in the past to migrate those
> discussions to a #kudu-dev or something similar. Would be interested in
> seeing whether others think it's time to bring this to fruition.
>
> I should also point out that the Cloudera Community forums are also a nice
> platform for Q&A. There's a board for Impala, where Kudu questions are
> often asked, so feel free to ask questions there too!
>
> On Wed, Nov 29, 2017 at 7:03 AM, Boris Tyukin 
> wrote:
>
>> Hi folks,
>>
>> as a new user to Kudu, it is confusing what is the best venue to post
>> user type questions about Kudu which is important for any thriving open
>> source project. I have posted some questions on slack and got a feeling
>> they were not welcome there as discussions on slack seem to be focused on
>> development. I can see that slack group is very active though.
>>
>> the user group is not that active though with like 10 email threads this
>> month.
>>
>> Can someone clarify this for us newcomers?
>>
>> We also have official channel for paying CDH customers but there are
>> benefits to use informal ones :)
>>
>> Thanks for such an amazing product and everything you do!
>>
>> Boris
>>
>
>
>
> --
> Andrew Wong
>


Re: [DISCUSS] Move Slack discussions to ASF official slack?

2017-10-23 Thread Mike Percy
Users will likely be confused if they have to switch Slack instances. We 
switched over to ASF mailing lists over a year ago and we still get requests to 
join the old pre-ASF user mailing list sometimes.

Unfortunately the Slack-In inviter bot doesn’t allow you to invite people to a 
particular room without a paid account. It has to go to the default room for 
the whole instance. Maybe we could ask Slack if it’s possible to get an 
exception for the ASF.

That said, if it’s not strictly better than what we have then I don’t see a 
real benefit in switching.

Mike

> On Oct 24, 2017, at 8:22 AM, Todd Lipcon  wrote:
> 
>> On Mon, Oct 23, 2017 at 4:12 PM, Misty Stanley-Jones  
>> wrote:
>> 1.  I have no idea, but you could enable the @all at-mention in the eisting 
>> #kudu-general and let people know that way. Also see my next answer.
>> 
> 
> Fair enough.
>  
>> 2.  It looks like if you have an apache.org email address you don't need an 
>> invite, but otherwise an existing member needs to invite you. If you can 
>> somehow get all the member email addresses, you can invite them all at once 
>> as a comma-separated list.
> 
> I'm not sure if that's doable but potentially.
> 
> I'm concerned though if we don't have auto-invite for arbitrary community 
> members who just come by a link from our website. A good portion of our 
> traffic is users, rather than developers, and by-and-large they don't have 
> apache.org addresses. If we closed the Slack off to them I think we'd lose a 
> lot of the benefit.
>  
>> 
>> 3.  I can't tell what access there is to integrations. I can try to find out 
>> who administers that on ASF infra and get back with you. I would not be 
>> surprised if integrations with the ASF JIRA were already enabled.
>> 
>> I pre-emptively grabbed #kudu on the ASF slack in case we decide to go 
>> forward with this. If we don't decide to go forward with it, it's a good 
>> idea to hold onto the channel and pin a message in there about how to get to 
>> the "official" Kudu slack.
>> 
>>> On Mon, Oct 23, 2017 at 3:00 PM, Todd Lipcon  wrote:
>>> A couple questions about this:
>>> 
>>> - is there any way we can email out to our existing Slack user base to 
>>> invite them to move over? We have 866 members on our current slack and 
>>> would be a shame if people got confused as to where to go for questions.
>>> 
>>> - does the ASF slack now have a functioning self-serve "auto-invite" 
>>> service?
>>> 
>>> - will we still be able to set up integrations like JIRA/github?
>>> 
>>> -Todd
>>> 
 On Mon, Oct 23, 2017 at 2:53 PM, Misty Stanley-Jones  
 wrote:
 When we first started using Slack, I don't think the ASF Slack instance
 existed. Using our own Slack instance means that we have limited access to
 message archives (unless we pay) and that people who work on multiple ASF
 projects need to add the Kudu slack in addition to any other Slack
 instances they may be on. I propose that we instead create one or more
 Kudu-related channels on the official ASF slack (http://the-asf.slack.com/)
 and migrate our discussions there. What does everyone think?
>>> 
>>> 
>>> 
>>> -- 
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera


Re: Change Data Capture (CDC) with Kudu

2017-09-22 Thread Mike Percy
Franco,
I just realized that I suggested something you mentioned in your initial
email. My mistake for not reading through to the end. It is probably the
least-worst approach right now and it's probably what I would do if I were
you.

Mike

On Fri, Sep 22, 2017 at 2:29 PM, Mike Percy  wrote:

> CDC is something that I would like to see in Kudu but we aren't there yet
> with the underlying support in the Raft Consensus implementation. Once we
> have higher availability re-replication support (KUDU-1097) we will be a
> bit closer for a solution involving traditional WAL streaming to an
> external consumer because we will have support for non-voting replicas. But
> there would still be plenty of work to do to support CDC after that, at
> least from an API perspective as well as a WAL management perspective (how
> long to keep old log files).
>
> That said, what you really are asking for is a streaming backup solution,
> which may or may not use the same mechanism (unfortunately it's not
> designed or implemented yet).
>
> As an alternative to Adar's suggestions, a reasonable option for you at
> this time may be an incremental backup. It takes a little schema design to
> do it, though. You could consider doing something like the following:
>
>1. Add a last_updated column to all your tables and update the column
>when you change the value. Ideally monotonic across the cluster but you
>could also go with local time and build in a "fudge factor" when reading in
>step 2
>2. Periodically scan the table for any changes newer than the previous
>scan in the last_updated column. This type of scan is more efficient to do
>in Kudu than in many other systems. With Impala you could run a query like:
>select * from table1 where last_updated > $prev_updated;
>3. Dump the results of this query to parquet
>4. Use distcp to copy the parquet files over to the other cluster
>periodically (maybe you can throttle this if needed to avoid saturating the
>pipe)
>5. Upsert the parquet data into Kudu on the remote end
>
> Hopefully some workaround like this would work for you until Kudu has a
> reliable streaming backup solution.
>
> Like Adar said, as an Apache project we are always open to contributions
> and it would be great to get some in this area. Please reach out if you're
> interested in collaborating on a design.
>
> Mike
>
> On Fri, Sep 22, 2017 at 10:43 AM, Adar Lieber-Dembo 
> wrote:
>
>> Franco,
>>
>> Thanks for the detailed description of your problem.
>>
>> I'm afraid there's no such mechanism in Kudu today. Mining the WALs seems
>> like a path fraught with land mines. Kudu GCs WAL segments aggressively so
>> I'd be worried about a listening mechanism missing out on some row
>> operations. Plus the WAL is Raft-specific as it includes both REPLICATE
>> messages (reflecting a Write RPC from a client) and COMMIT messages
>> (written out when a majority of replicas have written a REPLICATE); parsing
>> and making sense of this would be challenging. Perhaps you could build
>> something using Linux's inotify system for receiving file change
>> notifications, but again I'd be worried about missing certain updates.
>>
>> Another option is to replicate the data at the OS level. For example, you
>> could periodically rsync the entire cluster onto a standby cluster. There's
>> bound to be data loss in the event of a failover, but I don't think you'll
>> run into any corruption (though Kudu does take advantage of sparse files
>> and hole punching, so you should verify that any tool you use supports
>> that).
>>
>> Disaster Recovery is an oft-requested feature, but one that Kudu
>> developers have been unable to prioritize yet. Would you or your someone on
>> your team be interested in working on this?
>>
>> On Thu, Sep 21, 2017 at 7:12 PM Franco Venturi 
>> wrote:
>>
>>> We are planning for a 50-100TB Kudu installation (about 200 tables or
>>> so).
>>>
>>> One of the requirements that we are working on is to have a secondary
>>> copy of our data in a Disaster Recovery data center in a different location.
>>>
>>>
>>> Since we are going to have inserts, updates, and deletes (for instance
>>> in the case the primary key is changed), we are trying to devise a process
>>> that will keep the secondary instance in sync with the primary one. The two
>>> instances do not have to be identical in real-time (i.e. we are not looking
>>> for synchronous writes to Kudu), but we would like to have some pretty g

Re: Change Data Capture (CDC) with Kudu

2017-09-22 Thread Mike Percy
CDC is something that I would like to see in Kudu but we aren't there yet
with the underlying support in the Raft Consensus implementation. Once we
have higher availability re-replication support (KUDU-1097) we will be a
bit closer for a solution involving traditional WAL streaming to an
external consumer because we will have support for non-voting replicas. But
there would still be plenty of work to do to support CDC after that, at
least from an API perspective as well as a WAL management perspective (how
long to keep old log files).

That said, what you really are asking for is a streaming backup solution,
which may or may not use the same mechanism (unfortunately it's not
designed or implemented yet).

As an alternative to Adar's suggestions, a reasonable option for you at
this time may be an incremental backup. It takes a little schema design to
do it, though. You could consider doing something like the following:

   1. Add a last_updated column to all your tables and update the column
   when you change the value. Ideally monotonic across the cluster but you
   could also go with local time and build in a "fudge factor" when reading in
   step 2
   2. Periodically scan the table for any changes newer than the previous
   scan in the last_updated column. This type of scan is more efficient to do
   in Kudu than in many other systems. With Impala you could run a query like:
   select * from table1 where last_updated > $prev_updated;
   3. Dump the results of this query to parquet
   4. Use distcp to copy the parquet files over to the other cluster
   periodically (maybe you can throttle this if needed to avoid saturating the
   pipe)
   5. Upsert the parquet data into Kudu on the remote end

Hopefully some workaround like this would work for you until Kudu has a
reliable streaming backup solution.

Like Adar said, as an Apache project we are always open to contributions
and it would be great to get some in this area. Please reach out if you're
interested in collaborating on a design.

Mike

On Fri, Sep 22, 2017 at 10:43 AM, Adar Lieber-Dembo 
wrote:

> Franco,
>
> Thanks for the detailed description of your problem.
>
> I'm afraid there's no such mechanism in Kudu today. Mining the WALs seems
> like a path fraught with land mines. Kudu GCs WAL segments aggressively so
> I'd be worried about a listening mechanism missing out on some row
> operations. Plus the WAL is Raft-specific as it includes both REPLICATE
> messages (reflecting a Write RPC from a client) and COMMIT messages
> (written out when a majority of replicas have written a REPLICATE); parsing
> and making sense of this would be challenging. Perhaps you could build
> something using Linux's inotify system for receiving file change
> notifications, but again I'd be worried about missing certain updates.
>
> Another option is to replicate the data at the OS level. For example, you
> could periodically rsync the entire cluster onto a standby cluster. There's
> bound to be data loss in the event of a failover, but I don't think you'll
> run into any corruption (though Kudu does take advantage of sparse files
> and hole punching, so you should verify that any tool you use supports
> that).
>
> Disaster Recovery is an oft-requested feature, but one that Kudu
> developers have been unable to prioritize yet. Would you or your someone on
> your team be interested in working on this?
>
> On Thu, Sep 21, 2017 at 7:12 PM Franco Venturi 
> wrote:
>
>> We are planning for a 50-100TB Kudu installation (about 200 tables or so).
>>
>> One of the requirements that we are working on is to have a secondary
>> copy of our data in a Disaster Recovery data center in a different location.
>>
>>
>> Since we are going to have inserts, updates, and deletes (for instance in
>> the case the primary key is changed), we are trying to devise a process
>> that will keep the secondary instance in sync with the primary one. The two
>> instances do not have to be identical in real-time (i.e. we are not looking
>> for synchronous writes to Kudu), but we would like to have some pretty good
>> confidence that the secondary instance contains all the changes that the
>> primary has up to say an hour before (or something like that).
>>
>>
>> So far we considered a couple of options:
>> - refreshing the seconday instance with a full copy of the primary one
>> every so often, but that would mean having to transfer say 50TB of data
>> between the two locations every time, and our network bandwidth constraints
>> would prevent to do that even on a daily basis
>> - having a column that contains the most recent time a row was updated,
>> however this column couldn't be part of the primary key (because the
>> primary key in Kudu is immutable), and therefore finding which rows have
>> been changed every time would require a full scan of the table to be
>> sync'd. It would also rely on the "last update timestamp" column to be
>> always updated by the application (an assumption that we would l

Re: kudu-tserver died suddenly

2017-06-05 Thread Mike Percy
It seems unrelated. What release did you cherry-pick on top of?

INFO log from around that timeframe would be useful.

Mike

On Mon, Jun 5, 2017 at 6:38 PM, Jason Heo  wrote:

> Hello.
>
> I'm using this patch https://gerrit.cloudera.org/#/c/6925/
>
> One of tservers died suddenly. Here is ERROR and FATAL log.
>
> E0605 15:04:33.376554 138642 tablet.cc:1219] T
> 3cca831acf744e1daee72582b8e16dc4 P 125dbd2ffb8a401bb7e4fd982995ccf8:
> Rowset selected for compaction but not available anymore: RowSet(150)
>
> E0605 15:04:33.376605 138642 tablet.cc:1219] T
> 3cca831acf744e1daee72582b8e16dc4 P 125dbd2ffb8a401bb7e4fd982995ccf8:
> Rowset selected for compaction but not available anymore: RowSet(59)
>
> E0605 15:04:33.376615 138642 tablet.cc:1219] T
> 3cca831acf744e1daee72582b8e16dc4 P 125dbd2ffb8a401bb7e4fd982995ccf8:
> Rowset selected for compaction but not available anymore: RowSet(60)
>
> F0605 15:04:33.377100 138642 tablet.cc:1222] T
> 3cca831acf744e1daee72582b8e16dc4 P 125dbd2ffb8a401bb7e4fd982995ccf8: Was
> unable to find all rowsets selected for compaction
>
>
> 
>
>
> Could I know what's the problem? Feel free to ask any information to
> resolve it.
>
>
> Thank,
>
>
> Jason
>


Re: Table size is not decreasing after large amount of rows deleted.

2017-04-24 Thread Mike Percy
Yep, that's right -- currently the only thing that reclaims space taken by
deleted rows is a RowSet merge compaction. We haven't added any logic to
trigger those based on the number of deleted rows in a RowSet; they are
currently only triggered by logic which tries to merge RowSets with
overlapping key ranges (see https://github.com/apache
/kudu/blob/master/docs/design-docs/compaction-policy.md#
intuition-behind-compaction-selection-policy and BudgetedCompactionPolicy::
PickRowSets()).

The follow-up work to add a background task to permanently remove deleted
rows is being tracked in https://issues.apache.org/jira/browse/KUDU-1979
(which I just filed).

Mike

On Mon, Apr 24, 2017 at 12:37 PM, Todd Lipcon  wrote:

> Mike can correct me if wrong, but I think the background task in 1.3 is
> only responsible for removing old deltas, and doesn't do anything to try to
> trigger compactions on rowsets with a high percentage of deleted _rows_.
>
> That's a separate bit of work that hasn't been started yet.
>
> -Todd
>
> On Sat, Apr 22, 2017 at 7:36 PM, Jason Heo 
> wrote:
>
>> Hi David.
>>
>> Thank you for your reply.
>>
>> I'll try to upgrade to 1.3 this week.
>>
>> Regards,
>>
>> Jason
>>
>> 2017-04-23 2:06 GMT+09:00 :
>>
>>> Hi Jason
>>>
>>>   In Kudu 1.2 if there are compactions happening, they will reclaim
>>> space. Unfortunately the conditions for this to happen don't always
>>> occur (if the portion of the keyspace where the deletions occurred
>>> stopped receiving writes and was already fully compacted cleanup is
>>> more unlikely)
>>>   In Kudu 1.3 we added a background task to clean up old data even in
>>> the absence of compactions. Could you upgrade?
>>>
>>> Best
>>> David
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: Number of data files and opened file descriptors are not decreasing after DROP TABLE.

2017-04-24 Thread Mike Percy
HI Jason,
I would strongly recommend upgrading to Kudu 1.3.1 as 1.3.0 has a serious
data-loss bug related to re-replication. Please see https://kudu.apache.org/
releases/1.3.1/docs/release_notes.html (if you are using the Cloudera
version of 1.3.0, no need to worry because it includes the fix for that
bug).

In 1.3.0 and 1.3.1 you should be able to use the "kudu fs check" tool to
see if you have orphaned blocks. If you do, you could use the --repair
argument to that tool to repair it if you bring your tablet server offline.

That said, Kudu uses hole punching to delete data and the same container
files may remain open even after removing data. After dropping tables, you
should see disk usage at the file system level drop.

I'm not sure that I've answered all your questions. If you have specific
concerns, please let us know what you are worried about.

Mike

On Sun, Apr 23, 2017 at 11:43 PM, Jason Heo  wrote:

> Hi.
>
> Before dropping, there were about 30 tables, 27,000 files in tablet_data
>  directory.
> I dropped most tables and there is ONLY one table which has 400 tablets in
> my test Kudu cluster.
> After dropping, there are still 27,000 files in tablet_data directory,
> and output of /sbin/lsof is the same before dropping. (kudu tserver opens
> almost 50M files)
>
> I'm curious that this can be resolved using "kudu fs check" which is
> available at Kudu 1.4.
>
> I used Kudu 1.2 when executing `DROP TABLE` and currently using Kudu 1.3.0
>
> Regards,
>
> Jason
>
>


Re: Spark on Kudu Roadmap

2017-03-27 Thread Mike Percy
Hi Ben,
I don't really know so I'll let someone else more familiar with the Spark
integration chime in on that. However I searched the Kudu JIRA and I don't
see a tracking ticket filed on this (the closest thing I could find was
https://issues.apache.org/jira/browse/KUDU-1676 ) so you may want to file a
JIRA to help track this feature.

Mike


On Mon, Mar 27, 2017 at 11:55 AM, Benjamin Kim  wrote:

> Hi Mike,
>
> I believe what we are looking for is this below. It is an often request
> use case.
>
> Anyone know if the Spark package will ever allow for creating tables in
> Spark SQL?
>
> Such as:
>CREATE EXTERNAL TABLE 
>USING org.apache.kudu.spark.kudu
>OPTIONS (Map("kudu.master" -> “", "kudu.table" ->
> “table-name”));
>
> In this way, plain SQL can be used to do DDL, DML statements whether in
> Spark SQL code or using JDBC to interface with Spark SQL Thriftserver.
>
>
> Thanks,
> Ben
>
>
>
> On Mar 27, 2017, at 11:01 AM, Mike Percy  wrote:
>
> Hi Ben,
> Is there anything in particular you are looking for?
>
> Thanks,
> Mike
>
> On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim  wrote:
>
>> Hi,
>>
>> Are there any plans for deeper integration with Spark especially Spark
>> SQL? Is there a roadmap to look at, so I can know what to expect in the
>> future?
>>
>> Cheers,
>> Ben
>
>
>
>


Re: Kudu on top of Alluxio

2017-03-27 Thread Mike Percy
+1 thanks for adding that Todd.

Mike


On Mon, Mar 27, 2017 at 9:55 AM, Todd Lipcon  wrote:

> On Sat, Mar 25, 2017 at 2:54 PM, Mike Percy  wrote:
>
>> Kudu currently relies on local storage on a POSIX file system. Right now
>> there is no support for S3, which would be interesting but is non-trivial
>> in certain ways (particularly if we wanted to rely on S3's replication and
>> disable Kudu's app-level replication).
>>
>> I would suggest using only either EXT4 or XFS file systems for production
>> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per
>> machine for the WAL and with the data disks on either SATA or SSD drives
>> depending on the workload. Anything else is untested AFAIK.
>>
>
> I would amend this and say that SSD for the WAL is nice to have, but not a
> requirement. We do lots of testing on non-SSD test clusters and I'm aware
> of many production clusters which also do not have SSD.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Re: Spark on Kudu Roadmap

2017-03-27 Thread Mike Percy
Hi Ben,
Is there anything in particular you are looking for?

Thanks,
Mike

On Mon, Mar 27, 2017 at 9:48 AM, Benjamin Kim  wrote:

> Hi,
>
> Are there any plans for deeper integration with Spark especially Spark
> SQL? Is there a roadmap to look at, so I can know what to expect in the
> future?
>
> Cheers,
> Ben


Re: Kudu on top of Alluxio

2017-03-25 Thread Mike Percy
Yeah. I think the reason HBase can pretty easily use something like Alluxio or 
S3 and Kudu can't as easily do it is because HBase already relied on external 
storage (HDFS) for replication so substituting another storage system with 
similar properties doesn't really amount to an architectural change for them.

Mike

Sent from my iPhone

> On Mar 25, 2017, at 3:43 PM, Benjamin Kim  wrote:
> 
> Mike,
> 
> Thanks for the informative answer. I asked this question because I saw that 
> Alluxio can be used to handle storage for HBase. Plus, we could keep our 
> cluster size to a minimum and not need to add more nodes based on storage 
> capacity. We would only need to size our clusters based on load (cores, 
> memory, bandwidth) instead.
> 
> Cheers,
> Ben
> 
> 
>> On Mar 25, 2017, at 2:54 PM, Mike Percy  wrote:
>> 
>> Kudu currently relies on local storage on a POSIX file system. Right now 
>> there is no support for S3, which would be interesting but is non-trivial in 
>> certain ways (particularly if we wanted to rely on S3's replication and 
>> disable Kudu's app-level replication).
>> 
>> I would suggest using only either EXT4 or XFS file systems for production 
>> deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per 
>> machine for the WAL and with the data disks on either SATA or SSD drives 
>> depending on the workload. Anything else is untested AFAIK.
>> 
>> As for Alluxio, I haven't heard of people using it for permanent storage and 
>> since Kudu has its own block cache I don't think it would really help with 
>> caching. Also I don't recall Tachyon providing POSIX semantics.
>> 
>> Mike
>> 
>> Sent from my iPhone
>> 
>>> On Mar 25, 2017, at 9:50 AM, Benjamin Kim  wrote:
>>> 
>>> Hi,
>>> 
>>> Does anyone know of a way to use AWS S3 or 
>> 
> 



Re: Kudu on top of Alluxio

2017-03-25 Thread Mike Percy
Kudu currently relies on local storage on a POSIX file system. Right now there 
is no support for S3, which would be interesting but is non-trivial in certain 
ways (particularly if we wanted to rely on S3's replication and disable Kudu's 
app-level replication).

I would suggest using only either EXT4 or XFS file systems for production 
deployments as of Kudu 1.3, in a JBOD configuration, with one SSD per machine 
for the WAL and with the data disks on either SATA or SSD drives depending on 
the workload. Anything else is untested AFAIK.

As for Alluxio, I haven't heard of people using it for permanent storage and 
since Kudu has its own block cache I don't think it would really help with 
caching. Also I don't recall Tachyon providing POSIX semantics.

Mike

Sent from my iPhone

> On Mar 25, 2017, at 9:50 AM, Benjamin Kim  wrote:
> 
> Hi,
> 
> Does anyone know of a way to use AWS S3 or 



Re: Unsubscribe

2017-02-16 Thread Mike Percy
(bcc: user@kudu.a.o)

Please go to this page to find the link to unsubscribe: 
http://kudu.apache.org/community.html

Thanks,
Mike

Sent from my iPhone

> On Feb 16, 2017, at 11:01 PM, Peter Litsegård  
> wrote:
> 
> 


Re: [ANNOUNCE] Jordan Birdsell joining the Kudu PMC

2016-11-08 Thread Mike Percy
Congrats Jordan! Very well deserved!

Mike

On Mon, Nov 7, 2016 at 11:50 PM, Todd Lipcon  wrote:

> Hi Kudu community,
>
> Today I'm very pleased to announce that the PMC has voted to add Jordan
> Birdsell as a committer and PMC member on the Apache Kudu project!
>
> Jordan began contributing to Kudu a few months ago and since then has
> amassed a great number of contributions. He has almost single-handedly
> brought the Python client up to feature parity with the Java and C++
> clients, and has also been actively helping other new users as they get
> started with Kudu.
>
> Please join me in congratulating Jordan!
>
> -Todd
>


Re: please subscribe..

2016-11-03 Thread Mike Percy
Nirav,
Please send email to user-subscr...@kudu.apache.org to subscribe. You can
find some more info at http://kudu.apache.org/community.html

Thanks,
Mike

On Thu, Nov 3, 2016 at 2:02 PM, Nirav Patel  wrote:

>
>


Re: [ANNOUNCE] Two new Kudu committer/PMC members

2016-09-12 Thread Mike Percy
Congrats Alexey and Will! Great work.

Best,
Mike

On Mon, Sep 12, 2016 at 3:55 PM, Todd Lipcon  wrote:

> Hi Kudu community,
>
> It's my great pleasure to announce that the PMC has voted to add both
> Alexey Serbin and Will Berkeley as committers and PMC members.
>
> Alexey has been contributing for a few months, including developing some
> pretty meaty (and tricky) additions. Two of note are the addition of
> doxygen for our client APIs, as well as the implementation of
> AUTO_FLUSH_BACKGROUND in C++. He has also been quite active in reviews
> recently, having reviewed 40+ patches in the last couple months. He also
> contributed by testing and voting on the recent 0.10 release.
>
> Will has been a great contributor as well, spending a lot of time in areas
> that don't get as much love from others. In particular, he's made several
> fixes to the web UIs, has greatly improved the Flume integration, and has
> been burning down a lot of long-standing bugs recently. Will also spends a
> lot of his time outside of Kudu working with users and always has good
> input on what our user community will think of a feature. Like Alexey, Will
> also participated in the 0.10 release process.
>
> Both of these community members have already been "acting the part"
> through their contributions detailed above, and the PMC is excited to
> continue working with them in their expanded roles.
>
> Please join me in congratulating them!
>
> -Todd
>


Re: Backup and restore of Kudu Metadata/Data

2016-08-23 Thread Mike Percy
Correction to my previous mail: it was pointed out to me that the create
table statement in the web UI does not include the partitioning
information. Actually, I filed that bug myself a while back and thought it
had been fixed but that's not the case:
https://issues.apache.org/jira/browse/KUDU-1253

For now, it seems you would need to manually keep track of your
partitioning scheme so that you can use the same one when recreating the
table. Actually, at recreation time, you could choose whatever partitioning
scheme you want before upserting the snapshot data.

Mike

On Tue, Aug 23, 2016 at 1:15 PM, Mike Percy  wrote:

> Hi Amit,
> If you only want to restore a single table then the data part should be
> easy, since you can only snapshot scan the data of a single table at a
> given time.
>
> An alternative way to restore a table is to look at the web UI and check
> out the create table statement shown there. It should list partitions, etc.
> If you copy that DDL then you can use that to recreate the table and then
> reload the data from the snapshot scan. I am assuming you would use
> something like HDFS to store the results of the snapshot scan. However that
> would be a somewhat manual process, we need to improve this.
>
> A couple other things to note:
> 1. Kudu doesn't currently provide table-wide snapshot scan consistency.
> The snapshot scan will be consistent on a per-tablet basis, however.
> 2. Using a snapshot scan will not restore the historical MVCC data after
> you load the snapshot.
>
> Best,
> Mike
>
>
> On Tue, Aug 23, 2016 at 3:17 AM, Amit Adhau 
> wrote:
>
>> Thanks a lot Mike. yes, proper backup and restore mechanism will
>> certainly help.
>>
>> I have one more question, if a need arise to restore any specific
>> table[partial restore], how can I identify which are the metadata and data
>> files related to that table, which I should restore or is it possible?
>>
>> Thanks,
>> Amit
>>
>> On Tue, Aug 23, 2016 at 4:44 AM, Mike Percy  wrote:
>>
>>> I would recommend a snapshot scan for data backup. You can easily do
>>> that with MapReduce.
>>>
>>> Metadata backup is tough. One thing you could do is backup the master
>>> data and wal directories. If your filesystem supports snapshots then taking
>>> a snapshot of those directories should give you a consistent backup.
>>> Otherwise you should shut down the master, copy the master data and wal
>>> dirs, then bring the master back up.
>>>
>>> For restoring a metadata backup, it's as simple as restoring the file
>>> system data for the master. For restoring a data backup, you could first
>>> drop the tables, recreate them, then run a MapReduce job that upserts all
>>> the data from the snapshot scan.
>>>
>>> All in all, backup and restore is something that is probably going to
>>> get worked on very soon, so thanks for reminding us. We know we need to
>>> document these procedures and make them easier and less rough around the
>>> edges.
>>>
>>> Although I know this has been discussed in the past, I couldn't find a
>>> JIRA so I filed https://issues.apache.org/jira/browse/KUDU-1575 to
>>> track this work.
>>>
>>> Best,
>>> Mike
>>>
>>>
>>> On Wed, Aug 17, 2016 at 7:05 PM, Mac Noland 
>>> wrote:
>>>
>>>> From an Impala perspective, is making a scheduled copy of the table
>>>> into HDFS an option for you?
>>>>
>>>> http://kudu.apache.org/faq.html
>>>>
>>>> How can I back up my Kudu data?
>>>> <http://kudu.apache.org/faq.html#how-can-i-back-up-my-kudu-data>
>>>>
>>>> Kudu doesn’t yet have a built-in backup mechanism. Similar to bulk
>>>> loading data, Impala can help if you have it available. You can use it to
>>>> copy your data into Parquet format using a statement like:
>>>>
>>>> INSERT INTO TABLE some_parquet_table SELECT * FROM kudu_table
>>>>
>>>> then use distcp <http://hadoop.apache.org/docs/r1.2.1/distcp2.html> to
>>>> copy the Parquet data to another cluster. While Kudu is in beta, we’re not
>>>> expecting people to deploy mission-critical workloads on it yet.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 17, 2016 at 7:07 AM, Amit Adhau 
>>>> wrote:
>>>>
>>>>> Hi Kudu team,
>>>>>
>>>>> Can you please suggest what would be the best way/policy to backup and
>>>>> restore the Kudu me

Re: Backup and restore of Kudu Metadata/Data

2016-08-23 Thread Mike Percy
Hi Amit,
If you only want to restore a single table then the data part should be
easy, since you can only snapshot scan the data of a single table at a
given time.

An alternative way to restore a table is to look at the web UI and check
out the create table statement shown there. It should list partitions, etc.
If you copy that DDL then you can use that to recreate the table and then
reload the data from the snapshot scan. I am assuming you would use
something like HDFS to store the results of the snapshot scan. However that
would be a somewhat manual process, we need to improve this.

A couple other things to note:
1. Kudu doesn't currently provide table-wide snapshot scan consistency. The
snapshot scan will be consistent on a per-tablet basis, however.
2. Using a snapshot scan will not restore the historical MVCC data after
you load the snapshot.

Best,
Mike


On Tue, Aug 23, 2016 at 3:17 AM, Amit Adhau  wrote:

> Thanks a lot Mike. yes, proper backup and restore mechanism will certainly
> help.
>
> I have one more question, if a need arise to restore any specific
> table[partial restore], how can I identify which are the metadata and data
> files related to that table, which I should restore or is it possible?
>
> Thanks,
> Amit
>
> On Tue, Aug 23, 2016 at 4:44 AM, Mike Percy  wrote:
>
>> I would recommend a snapshot scan for data backup. You can easily do that
>> with MapReduce.
>>
>> Metadata backup is tough. One thing you could do is backup the master
>> data and wal directories. If your filesystem supports snapshots then taking
>> a snapshot of those directories should give you a consistent backup.
>> Otherwise you should shut down the master, copy the master data and wal
>> dirs, then bring the master back up.
>>
>> For restoring a metadata backup, it's as simple as restoring the file
>> system data for the master. For restoring a data backup, you could first
>> drop the tables, recreate them, then run a MapReduce job that upserts all
>> the data from the snapshot scan.
>>
>> All in all, backup and restore is something that is probably going to get
>> worked on very soon, so thanks for reminding us. We know we need to
>> document these procedures and make them easier and less rough around the
>> edges.
>>
>> Although I know this has been discussed in the past, I couldn't find a
>> JIRA so I filed https://issues.apache.org/jira/browse/KUDU-1575 to track
>> this work.
>>
>> Best,
>> Mike
>>
>>
>> On Wed, Aug 17, 2016 at 7:05 PM, Mac Noland 
>> wrote:
>>
>>> From an Impala perspective, is making a scheduled copy of the table into
>>> HDFS an option for you?
>>>
>>> http://kudu.apache.org/faq.html
>>>
>>> How can I back up my Kudu data?
>>> <http://kudu.apache.org/faq.html#how-can-i-back-up-my-kudu-data>
>>>
>>> Kudu doesn’t yet have a built-in backup mechanism. Similar to bulk
>>> loading data, Impala can help if you have it available. You can use it to
>>> copy your data into Parquet format using a statement like:
>>>
>>> INSERT INTO TABLE some_parquet_table SELECT * FROM kudu_table
>>>
>>> then use distcp <http://hadoop.apache.org/docs/r1.2.1/distcp2.html> to
>>> copy the Parquet data to another cluster. While Kudu is in beta, we’re not
>>> expecting people to deploy mission-critical workloads on it yet.
>>>
>>>
>>>
>>> On Wed, Aug 17, 2016 at 7:07 AM, Amit Adhau 
>>> wrote:
>>>
>>>> Hi Kudu team,
>>>>
>>>> Can you please suggest what would be the best way/policy to backup and
>>>> restore the Kudu metadata/data on kudu side as well as on Impala side and
>>>> also, if that can be automated.
>>>>
>>>> --
>>>> Thanks & Regards,
>>>>
>>>> *Amit Adhau* | Data Architect
>>>>
>>>> *GLOBANT* | IND:+91 9821518132
>>>>
>>>> [image: Facebook] <https://www.facebook.com/Globant>
>>>>
>>>> [image: Twitter] <http://www.twitter.com/globant>
>>>>
>>>> [image: Youtube] <http://www.youtube.com/Globant>
>>>>
>>>> [image: Linkedin] <http://www.linkedin.com/company/globant>
>>>>
>>>> [image: Pinterest] <http://pinterest.com/globant/>
>>>>
>>>> [image: Globant] <http://www.globant.com/>
>>>>
>>>> The information contained in this e-mail may be confidential. It has
>>>> been sent for the sole use of the intended rec

Re: Backup and restore of Kudu Metadata/Data

2016-08-22 Thread Mike Percy
I would recommend a snapshot scan for data backup. You can easily do that
with MapReduce.

Metadata backup is tough. One thing you could do is backup the master data
and wal directories. If your filesystem supports snapshots then taking a
snapshot of those directories should give you a consistent backup.
Otherwise you should shut down the master, copy the master data and wal
dirs, then bring the master back up.

For restoring a metadata backup, it's as simple as restoring the file
system data for the master. For restoring a data backup, you could first
drop the tables, recreate them, then run a MapReduce job that upserts all
the data from the snapshot scan.

All in all, backup and restore is something that is probably going to get
worked on very soon, so thanks for reminding us. We know we need to
document these procedures and make them easier and less rough around the
edges.

Although I know this has been discussed in the past, I couldn't find a JIRA
so I filed https://issues.apache.org/jira/browse/KUDU-1575 to track this
work.

Best,
Mike


On Wed, Aug 17, 2016 at 7:05 PM, Mac Noland 
wrote:

> From an Impala perspective, is making a scheduled copy of the table into
> HDFS an option for you?
>
> http://kudu.apache.org/faq.html
>
> How can I back up my Kudu data?
> 
>
> Kudu doesn’t yet have a built-in backup mechanism. Similar to bulk loading
> data, Impala can help if you have it available. You can use it to copy your
> data into Parquet format using a statement like:
>
> INSERT INTO TABLE some_parquet_table SELECT * FROM kudu_table
>
> then use distcp  to
> copy the Parquet data to another cluster. While Kudu is in beta, we’re not
> expecting people to deploy mission-critical workloads on it yet.
>
>
>
> On Wed, Aug 17, 2016 at 7:07 AM, Amit Adhau 
> wrote:
>
>> Hi Kudu team,
>>
>> Can you please suggest what would be the best way/policy to backup and
>> restore the Kudu metadata/data on kudu side as well as on Impala side and
>> also, if that can be automated.
>>
>> --
>> Thanks & Regards,
>>
>> *Amit Adhau* | Data Architect
>>
>> *GLOBANT* | IND:+91 9821518132
>>
>> [image: Facebook] 
>>
>> [image: Twitter] 
>>
>> [image: Youtube] 
>>
>> [image: Linkedin] 
>>
>> [image: Pinterest] 
>>
>> [image: Globant] 
>>
>> The information contained in this e-mail may be confidential. It has been
>> sent for the sole use of the intended recipient(s). If the reader of this
>> message is not an intended recipient, you are hereby notified that any
>> unauthorized review, use, disclosure, dissemination, distribution or
>> copying of this communication, or any of its contents,
>> is strictly prohibited. If you have received it by mistake please let us
>> know by e-mail immediately and delete it from your system. Many thanks.
>>
>>
>>
>> La información contenida en este mensaje puede ser confidencial. Ha sido
>> enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
>> este mensaje no fuera el destinatario previsto, por el presente queda Ud.
>> notificado que cualquier lectura, uso, publicación, diseminación,
>> distribución o copiado de esta comunicación o su contenido está
>> estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
>> por error le agradeceremos notificarnos por e-mail inmediatamente y
>> eliminarlo de su sistema. Muchas gracias.
>>
>>
>


Re: Where can we Use Apache Kudu?

2016-08-05 Thread Mike Percy
Hi Darshan,
You should be able to use Kudu as an additional store alongside HDFS and
Phoenix. Your data scientists should be able to do joins across HDFS,
HBase, and Kudu using Spark. You could also use Apache Impala (incubating)
to do those joins, however Impala does not support accessing Phoenix, as
far as I know.

You can also access Kudu from R if you go through rimpala:
http://blog.cloudera.com/blog/2013/12/how-to-do-statistical-analysis-with-impala-and-r/
... but I have never used R, myself.

Hope this helps!
Mike

On Wed, Aug 3, 2016 at 11:02 PM, Darshan Shah  wrote:

> Following is our current architecture...
>
>
>
> We have huge data residing in HDFS.. That we do not want to change.
>
>
>
> With Impala select queries, we are taking that data and loading it in
> HBase, using Phoenix. Which is then used by data scientists to do analysis
> using R and Spark.
>
>
>
> Each data set creates new schemas and tables in hbase, so its fast for
> data scientists to do analysis...
>
>
>
>
>
> We want to go for Kudu for obvious advantages in this space.
>
>
>
> Can you tell me where can we fit it?
>
>
> Thanks,
>
> Darshan...
>