StateStoreSaveExec / StateStoreRestoreExec

2017-01-02 Thread Jeremy Smith
I have a question about state tracking in Structured Streaming. First let me briefly explain my use case: Given a mutable data source (i.e. an RDBMS) in which we assume we can retrieve a set of newly created row versions (being a row that was created or updated between two given `Offset`s,

Re: Spark has a compile dependency on scalatest

2016-10-30 Thread Jeremy Smith
NFO]| +- org.scala-lang:scala-reflect:jar:2.10.6:compile >> [INFO]| \- com.fasterxml.jackson.module: >> jackson-module-paranamer:jar:2.6.5:compile >> [INFO]+- org.apache.ivy:ivy:jar:2.4.0:compile >> [INFO]+- oro:oro:jar:2.0.8:compile >> [INFO]+- net.razorvine:pyrolite:

Spark has a compile dependency on scalatest

2016-10-28 Thread Jeremy Smith
Hey everybody, Just a heads up that currently Spark 2.0.1 has a compile dependency on Scalatest 2.2.6. It comes from spark-core's dependency on spark-launcher, which has a transitive dependency on spark-tags, which has a compile dependency on Scalatest. This makes it impossible to use any other

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Jeremy Smith
+1 We're on CDH, and it will probably be a while before they support Kafka 0.10. At the same time, we don't use their Spark and we're looking forward to upgrading to 2.0.x and using structured streaming. I was just going to write our own Kafka Source implementation which uses the existing

Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeremy Smith
Take a look at how the messages are actually distributed across the partitions. If the message keys have a low cardinality, you might get poor distribution (i.e. all the messages are actually only in two of the five partitions, leading to what you see in Spark). If you take a look at the Kafka

Parquet partitioning / appends

2016-08-18 Thread Jeremy Smith
Hi, I'm running into an issue wherein Spark (both 1.6.1 and 2.0.0) will fail with a GC Overhead limit when creating a DataFrame from a parquet-backed partitioned Hive table with a relatively large number of parquet files (~ 175 partitions, and each partition contains many parquet files). If I

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Jeremy Smith
If you had a persistent, off-heap buffer of Arrow data on each executor, and you could get an iterator over that buffer from inside of a task, then you could conceivably define an RDD over it by just extending RDD and returning the iterator from the compute method. If you want to make a Dataset