Re: Improving performance of a kafka spark streaming app

Colin Kincaid Williams Mon, 02 May 2016 17:02:16 -0700

Hello again. I searched for "backport kafka" in the list archives but
couldn't find anything but a post from Spark 0.7.2 . I was going to
use accumulators to make a counter, but then saw on the Streaming tab
the Receiver Statistics. Then I removed all other "functionality"
except:



    JavaPairReceiverInputDStream<byte[], byte[]> dstream = KafkaUtils
      //createStream(JavaStreamingContext jssc,Class<K>
keyTypeClass,Class<V> valueTypeClass, Class<U> keyDecoderClass,
Class<T> valueDecoderClass, java.util.Map<String,String> kafkaParams,
java.util.Map<String,Integer> topics, StorageLevel storageLevel)
      .createStream(jssc, byte[].class, byte[].class,
kafka.serializer.DefaultDecoder.class,
kafka.serializer.DefaultDecoder.class, kafkaParamsMap, topicMap,
StorageLevel.MEMORY_AND_DISK_SER());

       dstream.print();

Then in the Recieiver Stats for the single receiver, I'm seeing around
380 records / second. Then to get anywhere near my 10% mentioned
above, I'd need to run around 21 receivers, assuming 380 records /
second, just using the print output. This seems awfully high to me,
considering that I wrote 80000+ records a second to Kafka from a
mapreduce job, and that my bottleneck was likely Hbase. Again using
the 380 estimate, I would need 200+ receivers to reach a similar
amount of reads.

Even given the issues with the 1.2 receivers, is this the expected way
to use the Kafka streaming API, or am I doing something terribly
wrong?

My application looks like
https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

On Mon, May 2, 2016 at 6:09 PM, Cody Koeninger <c...@koeninger.org> wrote:
> Have you tested for read throughput (without writing to hbase, just
> deserialize)?
>
> Are you limited to using spark 1.2, or is upgrading possible?  The
> kafka direct stream is available starting with 1.3.  If you're stuck
> on 1.2, I believe there have been some attempts to backport it, search
> the mailing list archives.
>
> On Mon, May 2, 2016 at 12:54 PM, Colin Kincaid Williams <disc...@uw.edu> 
> wrote:
>> I've written an application to get content from a kafka topic with 1.7
>> billion entries,  get the protobuf serialized entries, and insert into
>> hbase. Currently the environment that I'm running in is Spark 1.2.
>>
>> With 8 executors and 2 cores, and 2 jobs, I'm only getting between
>> 0-2500 writes / second. This will take much too long to consume the
>> entries.
>>
>> I currently believe that the spark kafka receiver is the bottleneck.
>> I've tried both 1.2 receivers, with the WAL and without, and didn't
>> notice any large performance difference. I've tried many different
>> spark configuration options, but can't seem to get better performance.
>>
>> I saw 80000 requests / second inserting these records into kafka using
>> yarn / hbase / protobuf / kafka in a bulk fashion.
>>
>> While hbase inserts might not deliver the same throughput, I'd like to
>> at least get 10%.
>>
>> My application looks like
>> https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877
>>
>> This is my first spark application. I'd appreciate any assistance.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Improving performance of a kafka spark streaming app

Reply via email to