Hi David, My current concern is that I'm using a spark hbase bulk put driver written for Spark 1.2 on the version of CDH my spark / yarn cluster is running on. Even if I were to run on another Spark cluster, I'm concerned that I might have issues making the put requests into hbase. However I should give it a shot if I abandon Spark 1.2, and my current environment.
Thanks, Colin Williams On Mon, May 2, 2016 at 6:06 PM, Krieg, David <david.kr...@earlywarning.com> wrote: > Spark 1.2 is a little old and busted. I think most of the advice you'll get is > to try to use Spark 1.3 at least, which introduced a new Spark streaming mode > (direct receiver). The 1.2 Receiver based implementation had a number of > shortcomings. 1.3 is where the "direct streaming" interface was introduced, > which is what we use. You'll get more joy the more you upgrade Spark, at least > to some extent. > > David Krieg | Enterprise Software Engineer > Early Warning > Direct: 480.426.2171 | Fax: 480.483.4628 | Mobile: 859.227.6173 > > > -----Original Message----- > From: Colin Kincaid Williams [mailto:disc...@uw.edu] > Sent: Monday, May 02, 2016 10:55 AM > To: user@spark.apache.org > Subject: Improving performance of a kafka spark streaming app > > I've written an application to get content from a kafka topic with 1.7 billion > entries, get the protobuf serialized entries, and insert into hbase. > Currently the environment that I'm running in is Spark 1.2. > > With 8 executors and 2 cores, and 2 jobs, I'm only getting between > 0-2500 writes / second. This will take much too long to consume the entries. > > I currently believe that the spark kafka receiver is the bottleneck. > I've tried both 1.2 receivers, with the WAL and without, and didn't notice any > large performance difference. I've tried many different spark configuration > options, but can't seem to get better performance. > > I saw 80000 requests / second inserting these records into kafka using yarn / > hbase / protobuf / kafka in a bulk fashion. > > While hbase inserts might not deliver the same throughput, I'd like to at > least get 10%. > > My application looks like > https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_drocsid_b0efa4ff6ff4a7c3c8bb56767d0b6877&d=CwIBaQ&c=rtKJL1IoQkrgf7t9D493SuUmYZJqgJmwEhoO6UD_DpY&r=rWkTz7PE5TRtkkWejPue_zcBxoTQE4f0g8LBaR2mVi8&m=pVPZ7WXHDTWO7s5u0qQupsWkiaGiv3B50BdtYvOvazo&s=_FnCXUJfmNKIVqDy046SS5YVP8cpJBQ3ynECFWJkzK8&e= > > This is my first spark application. I'd appreciate any assistance. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org