I am trying to deduplicate on streaming data using the dropDuplicate
function with watermark. The problem I am facing currently is that I have
to two timestamps for a given record
1. One is the eventtimestamp - timestamp of the record creation from the
source
2. Another is an transfer timestamp -
I want to write a structured spark streaming Kafka consumer which reads data
from a one partition Kafka topic, repartitions the incoming data by "key" to
3 spark partitions while keeping the messages ordered per key, and writes
them to another Kafka topic with 3 partitions.
I used
James,
How do you create an instance of `RDD[Iterable[MyCaseClass]]` ?
Is it in that first code snippet? > new SparkContext(sc).parallelize(seq)?
kr, Gerard
On Fri, Nov 30, 2018 at 3:02 PM James Starks
wrote:
> When processing data, I create an instance of RDD[Iterable[MyCaseClass]]
> and
Hello,
I'm trying to mount a local ceph volume to my mesos container.
My cephfs is mounted on all agent at /ceph
I'm using spark 2.4 with hadoop 3.11 and I'm not using Docker to deploy spark.
The only option I could find to mount a volume though is the following (which
is also a line I added
Hi James.
--num-executors is use to control the number of parallel tasks (each per
executors) running for your application. For reading and writing data in
parallel data partitioning is employed. You can look here for quick intro
how data partitioning work:
Curious why you think this is not smart code?
On Mon, Dec 3, 2018 at 8:04 AM James Starks
wrote:
> By taking with your advice flatMap, now I can convert result from
> RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform
> flatMap in the end before starting to convert RDD
Reading Spark doc
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not
mentioned how to parallel read parquet file with SparkSession. Would
--num-executors just work? Any additional parameters needed to be added to
SparkSession as well?
Also if I want to parallel
By taking with your advice flatMap, now I can convert result from
RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform
flatMap in the end before starting to convert RDD object back to DF (i.e.
SparkSession.createDataFrame(rddRecordsOfMyCaseClass)). For instance,
df.map {
From: Kappaganthu, Sivaram (CORP)
Sent: Saturday, December 1, 2018 10:18 PM
To: user@spark.apache.org
Subject: unsubscribe
--
This message and any attachments are intended only for the use of the addressee
and may contain
Hi,
There are not much details in the mail so hard to tell exactly. Maybe
you're facing
https://issues.apache.org/jira/browse/SPARK-19275
BR,
G
On Mon, Dec 3, 2018 at 10:32 AM JF Chen wrote:
> Some kafka consumer tasks throw "failed to get records for spark-executor
> after polling for ***"
Some kafka consumer tasks throw "failed to get records for spark-executor
after polling for ***" error some times. In detail, some tasks take very
long time, and throw this error. However while the task restarts, it
recovers very soon.
My spark version is 2.2.0
Regard,
Junfeng Chen
11 matches
Mail list logo