I have a question about state tracking in Structured Streaming.
First let me briefly explain my use case: Given a mutable data source (i.e.
an RDBMS) in which we assume we can retrieve a set of newly created row
versions (being a row that was created or updated between two given
`Offset`s,
NFO]| +- org.scala-lang:scala-reflect:jar:2.10.6:compile
>> [INFO]| \- com.fasterxml.jackson.module:
>> jackson-module-paranamer:jar:2.6.5:compile
>> [INFO]+- org.apache.ivy:ivy:jar:2.4.0:compile
>> [INFO]+- oro:oro:jar:2.0.8:compile
>> [INFO]+- net.razorvine:pyrolite:
Hey everybody,
Just a heads up that currently Spark 2.0.1 has a compile dependency on
Scalatest 2.2.6. It comes from spark-core's dependency on spark-launcher,
which has a transitive dependency on spark-tags, which has a compile
dependency on Scalatest.
This makes it impossible to use any other
+1
We're on CDH, and it will probably be a while before they support Kafka
0.10. At the same time, we don't use their Spark and we're looking forward
to upgrading to 2.0.x and using structured streaming.
I was just going to write our own Kafka Source implementation which uses
the existing
Take a look at how the messages are actually distributed across the
partitions. If the message keys have a low cardinality, you might get poor
distribution (i.e. all the messages are actually only in two of the five
partitions, leading to what you see in Spark).
If you take a look at the Kafka
Hi,
I'm running into an issue wherein Spark (both 1.6.1 and 2.0.0) will fail
with a GC Overhead limit when creating a DataFrame from a parquet-backed
partitioned Hive table with a relatively large number of parquet files (~
175 partitions, and each partition contains many parquet files). If I
If you had a persistent, off-heap buffer of Arrow data on each executor,
and you could get an iterator over that buffer from inside of a task, then
you could conceivably define an RDD over it by just extending RDD and
returning the iterator from the compute method. If you want to make a
Dataset