Re: Performance Question

Benjamin Kim Mon, 11 Jul 2016 10:40:25 -0700

Todd,

I had it at one replica. Do I have to recreate?


Thanks,
Ben


> On Jul 11, 2016, at 10:37 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hey Ben,
> 
> Is the table that you're querying replicated? Or was it created with only one 
> replica per tablet?
> 
> -Todd
> 
> On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim <b...@amobee.com 
> <mailto:b...@amobee.com>> wrote:
> Over the weekend, a tablet server went down. It’s not coming back up. So, I 
> decommissioned it and removed it from the cluster. Then, I restarted Kudu 
> because I was getting a timeout  exception trying to do counts on the table. 
> Now, when I try again. I get the same error.
> 
> 16/07/11 17:32:36 WARN scheduler.TaskSetManager: Lost task 468.3 in stage 0.0 
> (TID 603, prod-dc1-datanode167.pdc1i.gradientx.com 
> <http://prod-dc1-datanode167.pdc1i.gradientx.com/>): 
> com.stumbleupon.async.TimeoutException: Timed out after 30000ms when joining 
> Deferred@712342716(state=PAUSED, result=Deferred@1765902299, 
> callback=passthrough -> scanner opened -> wakeup thread Executor task launch 
> worker-2, errback=openScanner errback -> passthrough -> wakeup thread 
> Executor task launch worker-2)
> at com.stumbleupon.async.Deferred.doJoin(Deferred.java:1177)
> at com.stumbleupon.async.Deferred.join(Deferred.java:1045)
> at org.kududb.client.KuduScanner.nextRows(KuduScanner.java:57)
> at org.kududb.spark.kudu.RowResultIteratorScala.hasNext(KuduRDD.scala:99)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> Does anyone know how to recover from this?
> 
> Thanks,
> Benjamin Kim
> Data Solutions Architect
> 
> [a•mo•bee] (n.) the company defining digital marketing.
> 
> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  www.amobee.com 
> <http://www.amobee.com/>
>> On Jul 6, 2016, at 9:46 AM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> 
>> 
>> On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Over the weekend, the row count is up to <500M. I will give it another few 
>> days to get to 1B rows. I still get consistent times ~15s for doing row 
>> counts despite the amount of data growing.
>> 
>> On another note, I got a solicitation email from SnappyData to evaluate 
>> their product. They claim to be the “Spark Data Store” with tight 
>> integration with Spark executors. It claims to be an OLTP and OLAP system 
>> with being an in-memory data store first then to disk. After going to 
>> several Spark events, it would seem that this is the new “hot” area for 
>> vendors. They all (MemSQL, Redis, Aerospike, Datastax, etc.) claim to be the 
>> best "Spark Data Store”. I’m wondering if Kudu will become this too? With 
>> the performance I’ve seen so far, it would seem that it can be a contender. 
>> All that is needed is a hardened Spark connector package, I would think. The 
>> next evaluation I will be conducting is to see if SnappyData’s claims are 
>> valid by doing my own tests.
>> 
>> It's hard to compare Kudu against any other data store without a lot of 
>> analysis and thorough benchmarking, but it is certainly a goal of Kudu to be 
>> a great platform for ingesting and analyzing data through Spark.  Up till 
>> this point most of the Spark work has been community driven, but more 
>> thorough integration testing of the Spark connector is going to be a focus 
>> going forward.
>> 
>> - Dan
>> 
>>  
>> Cheers,
>> Ben
>> 
>> 
>> 
>>> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> Hi Benjamin,
>>> 
>>> What workload are you using for benchmarks? Using spark or something more 
>>> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
>>> some queries
>>> 
>>> Todd
>>> 
>>> Todd
>>> 
>>> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Todd,
>>> 
>>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am 
>>> impressed. Compared to HBase, read and write performance are better. Write 
>>> performance has the greatest improvement (> 4x), while read is > 1.5x. 
>>> Albeit, these are only preliminary tests. Do you know of a way to really do 
>>> some conclusive tests? I want to see if I can match your results on my 50 
>>> node cluster.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>>>> <mailto:t...@cloudera.com>> wrote:
>>>> 
>>>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Todd,
>>>> 
>>>> It sounds like Kudu can possibly top or match those numbers put out by 
>>>> Aerospike. Do you have any performance statistics published or any 
>>>> instructions as to measure them myself as good way to test? In addition, 
>>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>>>> where support will be built in?
>>>> 
>>>> We don't have a lot of benchmarks published yet, especially on the write 
>>>> side. I've found that thorough cross-system benchmarks are very difficult 
>>>> to do fairly and accurately, and often times users end up misguided if 
>>>> they pay too much attention to them :) So, given a finite number of 
>>>> developers working on Kudu, I think we've tended to spend more time on the 
>>>> project itself and less time focusing on "competition". I'm sure there are 
>>>> use cases where Kudu will beat out Aerospike, and probably use cases where 
>>>> Aerospike will beat Kudu as well.
>>>> 
>>>> From my perspective, it would be great if you can share some details of 
>>>> your workload, especially if there are some areas you're finding Kudu 
>>>> lacking. Maybe we can spot some easy code changes we could make to improve 
>>>> performance, or suggest a tuning variable you could change.
>>>> 
>>>> -Todd
>>>> 
>>>> 
>>>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>>>>> <mailto:t...@cloudera.com>> wrote:
>>>>> 
>>>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Hi Mike,
>>>>> 
>>>>> First of all, thanks for the link. It looks like an interesting read. I 
>>>>> checked that Aerospike is currently at version 3.8.2.3, and in the 
>>>>> article, they are evaluating version 3.5.4. The main thing that impressed 
>>>>> me was their claim that they can beat Cassandra and HBase by 8x for 
>>>>> writing and 25x for reading. Their big claim to fame is that Aerospike 
>>>>> can write 1M records per second with only 50 nodes. I wanted to see if 
>>>>> this is real.
>>>>> 
>>>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>>>> depending on the size of your records and the insertion order. I've been 
>>>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>>>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>>>>> better.
>>>>>  
>>>>> 
>>>>> To answer your questions, we have a DMP with user profiles with many 
>>>>> attributes. We create segmentation information off of these attributes to 
>>>>> classify them. Then, we can target advertising appropriately for our 
>>>>> sales department. Much of the data processing is for applying models on 
>>>>> all or if not most of every profile’s attributes to find similarities 
>>>>> (nearest neighbor/clustering) over a large number of rows when batch 
>>>>> processing or a small subset of rows for quick online scoring. So, our 
>>>>> use case is a typical advanced analytics scenario. We have tried HBase, 
>>>>> but it doesn’t work well for these types of analytics.
>>>>> 
>>>>> I read, that Aerospike in the release notes, they did do many 
>>>>> improvements for batch and scan operations.
>>>>> 
>>>>> I wonder what your thoughts are for using Kudu for this.
>>>>> 
>>>>> Sounds like a good Kudu use case to me. I've heard great things about 
>>>>> Aerospike for the low latency random access portion, but I've also heard 
>>>>> that it's _very_ expensive, and not particularly suited to the columnar 
>>>>> scan workload. Lastly, I think the Apache license of Kudu is much more 
>>>>> appealing than the AGPL3 used by Aerospike. But, that's not really a 
>>>>> direct answer to the performance question :)
>>>>>  
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com 
>>>>>> <mailto:mpe...@cloudera.com>> wrote:
>>>>>> 
>>>>>> Have you considered whether you have a scan heavy or a random access 
>>>>>> heavy workload? Have you considered whether you always access / update a 
>>>>>> whole row vs only a partial row? Kudu is a column store so has some 
>>>>>> awesome performance characteristics when you are doing a lot of scanning 
>>>>>> of just a couple of columns.
>>>>>> 
>>>>>> I don't know the answer to your question but if your concern is 
>>>>>> performance then I would be interested in seeing comparisons from a perf 
>>>>>> perspective on certain workloads.
>>>>>> 
>>>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: 
>>>>>> https://aphyr.com/posts/324-jepsen-aerospike 
>>>>>> <https://aphyr.com/posts/324-jepsen-aerospike>
>>>>>> 
>>>>>> I wonder if they have addressed any of those issues.
>>>>>> 
>>>>>> Mike
>>>>>> 
>>>>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> I am just curious. How will Kudu compare with Aerospike 
>>>>>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a 
>>>>>> Spark Roadshow and found out about this piece of software. It appears to 
>>>>>> fit our use case perfectly since we are an ad-tech company trying to 
>>>>>> leverage our user profiles data. Plus, it already has a Spark connector 
>>>>>> and has a SQL-like client. The tables can be accessed using Spark SQL 
>>>>>> DataFrames and, also, made into SQL tables for direct use with Spark SQL 
>>>>>> ODBC/JDBC Thriftserver. I see from the work done here 
>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration 
>>>>>> is well underway and, from the looks of it lately, almost complete. I 
>>>>>> would prefer to use Kudu since we are already a Cloudera shop, and Kudu 
>>>>>> is easy to deploy and configure using Cloudera Manager. I also hope that 
>>>>>> some of Aerospike’s speed optimization techniques can make it into Kudu 
>>>>>> in the future, if they have not been already thought of or included.
>>>>>> 
>>>>>> Just some thoughts…
>>>>>> 
>>>>>> Cheers,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> --
>>>>>> Mike Percy
>>>>>> Software Engineer, Cloudera
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Performance Question

Reply via email to