Re: [Structured Streaming] [Kafka] How to repartition the data and distribute the processing among worker nodes

2018-04-20 Thread Bowden, Chris
The primary role of a sink is storing output tuples. Consider groupByKey and map/flatMapGroupsWithState instead. -Chris From: karthikjay Sent: Friday, April 20, 2018 4:49:49 PM To: user@spark.apache.org Subject: [Structured Streaming]

Spark with Scala 2.12

2018-04-20 Thread Jatin Puri
Hello. I am wondering, if there is any new update on Spark upgrade to Scala 2.12. https://issues.apache.org/jira/browse/SPARK-14220. Especially given that Scala 2.13 is near the vicinity of a release. This is because, there is no recent update on the Jira and related ticket. May be someone is

Get application id when using SparkSubmit.main from java

2018-04-20 Thread Ron Gonzalez
Hi,  I am trying to get the application id after I use SparkSubmit.main for a yarn submission.  I am able to make it asynchronous using spark.yarn.watForCompletion=false configuration option, but I can't seem to figure out how I can get the application id for this job. I read both

[Structured Streaming] [Kafka] How to repartition the data and distribute the processing among worker nodes

2018-04-20 Thread karthikjay
Any help appreciated. please find the question in the link: https://stackoverflow.com/questions/49951022/spark-structured-streaming-with-kafka-how-to-repartition-the-data-and-distribu -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

[Structured Streaming] Restarting streaming query on exception/termination

2018-04-20 Thread Priyank Shrivastava
What's the right way of programmatically restarting a structured streaming query which has terminated due to an exception? Example code or reference would be appreciated. Could it be done from within the onQueryTerminated() event handler of StreamingQueryListener class? Priyank

Spark read parquet with unnamed index

2018-04-20 Thread Lord, Jesse
When reading a parquet created from a pandas dataframe with an unnamed index spark creates a column named “__index_level_0__” since spark DataFrames do not support row indexing. This looks like it is probably a bug to me, since as a spark user I would expect unnamed index columns to be dropped

[Structured Streaming][Kafka] For a Kafka topic with 3 partitions, how does the parallelism work ?

2018-04-20 Thread karthikjay
I have the following code to read data from Kafka topic using the structured streaming. The topic has 3 partitions: val spark = SparkSession .builder .appName("TestPartition") .master("local[*]") .getOrCreate() import spark.implicits._ val dataFrame = spark

unsubscribe

2018-04-20 Thread varma dantuluri
unsubscribe -- Regards, Varma Dantuluri

--driver-memory allocation question

2018-04-20 Thread klrmowse
newb question... say, memory per node is 16GB for 6 nodes (for a total of 96GB for the cluster) is 16GB the max amount of memory that can be allocated to driver? (since, it is, after all, 16GB per node) Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: [Spark 2.x Core] Job writing out an extra empty part-0000* file

2018-04-20 Thread klrmowse
well... it turns out, that extra part-* file goes away when i limit --num-executors to 1 or 2 (leaving it to default maxes it out, which in turn gives an extra empty part-file) i guess the test data i'm using only requires that many executors -- Sent from: