Spark 2.x duplicates output when task fails at "repartition" stage. Checkpointing is enabled before repartition.

2019-02-05 Thread Serega Sheypak
Hi, I have spark job that produces duplicates when one or tasks from repartition stage fails. Here is simplified code. sparkContext.setCheckpointDir("hdfs://path-to-checkpoint-dir") *val *inputRDDs: List[RDD[String]] = *List*.*empty *// an RDD per input dir *val *updatedRDDs = inputRDDs.map{

Re: DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Ryan Blue
Shubham, DataSourceV2 passes Spark's internal representation to your source and expects Spark's internal representation back from the source. That's why you consume and produce InternalRow: "internal" indicates that Spark doesn't need to convert the values. Spark's internal representation for a

Re: Structured streaming from Kafka by timestamp

2019-02-05 Thread Cody Koeninger
To be more explicit, the easiest thing to do in the short term is use your own instance of KafkaConsumer to get the offsets for the timestamps you're interested in, using offsetsForTimes, and use those for the start / end offsets. See

Re: Back pressure not working on streaming

2019-02-05 Thread Cody Koeninger
That article is pretty old, If you click through the link to the jira mentioned in it, https://issues.apache.org/jira/browse/SPARK-18580 , it's been resolved. On Wed, Jan 2, 2019 at 12:42 AM JF Chen wrote: > > yes, 10 is a very low value for testing initial rate. > And from this article >

DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Shubham Chaurasia
Hi All, I am using custom DataSourceV2 implementation (*Spark version 2.3.2*) Here is how I am trying to pass in *date type *from spark shell. scala> val df = > sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype", > col("datetype").cast("date")) > scala>