Re: [pyspark2.4+] When to choose RDD over Dataset, was: A lot of tasks failed, but job eventually completes

2020-01-06 Thread Enrico Minack
Hi Rishi, generally it is better to avoid RDDs if you can and use the Dataset API. With Datasets (formerly DataFrames) Spark can optimize your query / tree of transformations, RDDs are opaque. Datasets have an optimized memory footprint. Pure Dataset operations provide you helpful information

Unsubscribe

2020-01-06 Thread Rishabh Pugalia
Unsubscribe

Fwd: [Spark Streaming]: Why my Spark Direct stream is sending multiple offset commits to Kafka?

2020-01-06 Thread Raghu B
Hi Spark Community. I need help with the following issue and I have been researching about it from last 2 weeks and as a last and best resource I want to ask the Spark community. I am running the following code in Spark* * val sparkConf = new SparkConf()* *.setMaster("local[*]")*

unsubscribe

2020-01-06 Thread Bruno S. de Barros
  unsubscribe   - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-06 Thread Rishi Shah
Thank you Hemant and Enrico. Much appreciated. your input really got me closer to the issue, I realized every task didn't get enough memory and hence tasks with large partitions kept failing. I increased executor memory and at the same time increased number of partitions as well. This made the

Re: OrderBy Year and Month is not displaying correctly

2020-01-06 Thread Mich Talebzadeh
*The distinct transformation does not preserve order, you need to distinct first, then orderby.* Thanks Enrico. You are correct. Worked fine! joint_accounts. select(year(col("transactiondate")).as("Year") , month(col("transactiondate")).as("Month") ,

Re: OrderBy Year and Month is not displaying correctly

2020-01-06 Thread Gourav Sengupta
or just use SQL, which is less verbose, easily readable, and takes care of all such scenarios. But for some weird reason I have found that people using data frame API's have a perception that using SQL is less intelligent. But I think that using less effort to get better output can me a measure of

Re: OrderBy Year and Month is not displaying correctly

2020-01-06 Thread Enrico Minack
The distinct transformation does not preserve order, you need to distinct first, then orderby. Enrico Am 06.01.20 um 00:39 schrieb Mich Talebzadeh: Hi, I am working out monthly outgoing etc from an account and I am using the following code import org.apache.spark.sql.expressions.Window

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-06 Thread Enrico Minack
Note that repartitioning helps to increase the number of partitions (and hence to reduce the size of partitions and required executor memory), but subsequent transformations like join will repartition data again with the configured number of partitions (|spark.sql.shuffle.partitions|),