Hi mck, I'm not familiar with this ticket, but my understanding was that performance of Hadoop jobs on C* clusters with vnodes was poor because a given Hadoop input split has to run many individual scans (one for each vnode) rather than just a single scan. I've run C* and Hadoop in production with a custom input format that used vnodes (and just combined multiple vnodes in a single input split) and didn't have any issues (the jobs had many other performance bottlenecks besides starting multiple scans from C*).
This is one of the videos where I recall an off-hand mention of the Spark connector working with vnodes: https://www.youtube.com/watch?v=1NtnrdIUlg0 Best regards, Clint On Sat, Feb 21, 2015 at 2:58 PM, mck <m...@apache.org> wrote: > At least the problem of hadoop and vnodes described in CASSANDRA-6091 > doesn't apply to spark. > (Spark already allows multiple token ranges per split). > > If this is the reason why DSE hasn't enabled vnodes then fingers crossed > that'll change soon. > > > > Some of the DataStax videos that I watched discussed how the Cassandra > Spark connecter has > > optimizations to deal with vnodes. > > > Are these videos public? if so got any link to them? > > ~mck >