Spark Structured streaming - dropDuplicates with watermark

2018-12-03 Thread Nirmal Manoharan
I am trying to deduplicate on streaming data using the dropDuplicate function with watermark. The problem I am facing currently is that I have to two timestamps for a given record 1. One is the eventtimestamp - timestamp of the record creation from the source 2. Another is an transfer timestamp -

How to preserve event order per key in Structured Streaming Repartitioning By Key?

2018-12-03 Thread pmatpadi
I want to write a structured spark streaming Kafka consumer which reads data from a one partition Kafka topic, repartitions the incoming data by "key" to 3 spark partitions while keeping the messages ordered per key, and writes them to another Kafka topic with 3 partitions. I used

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread Gerard Maas
James, How do you create an instance of `RDD[Iterable[MyCaseClass]]` ? Is it in that first code snippet? > new SparkContext(sc).parallelize(seq)? kr, Gerard On Fri, Nov 30, 2018 at 3:02 PM James Starks wrote: > When processing data, I create an instance of RDD[Iterable[MyCaseClass]] > and

Using spark and mesos container with host_path volume

2018-12-03 Thread Antoine DUBOIS
Hello, I'm trying to mount a local ceph volume to my mesos container. My cephfs is mounted on all agent at /ceph I'm using spark 2.4 with hadoop 3.11 and I'm not using Docker to deploy spark. The only option I could find to mount a volume though is the following (which is also a line I added

Re: Parallel read parquet file, write to postgresql

2018-12-03 Thread Shahab Yunus
Hi James. --num-executors is use to control the number of parallel tasks (each per executors) running for your application. For reading and writing data in parallel data partitioning is employed. You can look here for quick intro how data partitioning work:

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread Shahab Yunus
Curious why you think this is not smart code? On Mon, Dec 3, 2018 at 8:04 AM James Starks wrote: > By taking with your advice flatMap, now I can convert result from > RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform > flatMap in the end before starting to convert RDD

Parallel read parquet file, write to postgresql

2018-12-03 Thread James Starks
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just work? Any additional parameters needed to be added to SparkSession as well? Also if I want to parallel

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread James Starks
By taking with your advice flatMap, now I can convert result from RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform flatMap in the end before starting to convert RDD object back to DF (i.e. SparkSession.createDataFrame(rddRecordsOfMyCaseClass)). For instance, df.map {

unsubscribe

2018-12-03 Thread Kappaganthu, Sivaram (CORP)
From: Kappaganthu, Sivaram (CORP) Sent: Saturday, December 1, 2018 10:18 PM To: user@spark.apache.org Subject: unsubscribe -- This message and any attachments are intended only for the use of the addressee and may contain

Re: "failed to get records for spark-executor after polling for ***" error

2018-12-03 Thread Gabor Somogyi
Hi, There are not much details in the mail so hard to tell exactly. Maybe you're facing https://issues.apache.org/jira/browse/SPARK-19275 BR, G On Mon, Dec 3, 2018 at 10:32 AM JF Chen wrote: > Some kafka consumer tasks throw "failed to get records for spark-executor > after polling for ***"

"failed to get records for spark-executor after polling for ***" error

2018-12-03 Thread JF Chen
Some kafka consumer tasks throw "failed to get records for spark-executor after polling for ***" error some times. In detail, some tasks take very long time, and throw this error. However while the task restarts, it recovers very soon. My spark version is 2.2.0 Regard, Junfeng Chen