Re: Spark Kafka Batch Write guarantees

2019-04-01 Thread hemant singh
Thanks Shixiong, read in documentation as well that duplicates might exist because of task retries. On Mon, 1 Apr 2019 at 9:43 PM, Shixiong(Ryan) Zhu wrote: > The Kafka source doesn’t support transaction. You may see partial data or > duplicated data if a Spark task fails. > > On Wed, Mar 27,

[Spark ML] [Pyspark] [Scenario Beginner] [Level Beginner]

2019-04-01 Thread Steve Pruitt
After following a tutorial on Recommender systems using Pyspark / Spark ML. I decided to jump in with my own dataset. I am specifically trying to predict video suggestions based on an implicit feature for the time a video was watched. I wrote a generator to produce my dataset. I have a

Re: Spark streaming error - Query terminated with exception: assertion failed: Invalid batch: a#660,b#661L,c#662,d#663,,… 26 more fields != b#1291L

2019-04-01 Thread Shixiong(Ryan) Zhu
Could you try to use $”a” rather than df(“a”)? The latter one sometimes doesn’t work. On Thu, Mar 21, 2019 at 10:41 AM kineret M wrote: > I try to read a stream using my custom data source (v2, using spark 2.3), > and it fails *in the second iteration* with the following exception while >

Re: Spark Kafka Batch Write guarantees

2019-04-01 Thread Shixiong(Ryan) Zhu
The Kafka source doesn’t support transaction. You may see partial data or duplicated data if a Spark task fails. On Wed, Mar 27, 2019 at 1:15 AM hemant singh wrote: > We are using spark batch to write Dataframe to Kafka topic. The spark > write function with write.format(source = Kafka). > Does

Re: Spark SQL API taking longer time than DF API.

2019-04-01 Thread neeraj bhadani
In Both the cases, I am trying to create a HIVE table based on Union on 2 same queries. Not sure how internally it differs on the process of creation of HIVE table? Regards, Neeraj On Sun, Mar 31, 2019 at 1:29 PM Jörn Franke wrote: > Is the select taking longer or the saving to a file. You