Hi Rishi,
generally it is better to avoid RDDs if you can and use the Dataset API.
With Datasets (formerly DataFrames) Spark can optimize your query / tree
of transformations, RDDs are opaque. Datasets have an optimized memory
footprint. Pure Dataset operations provide you helpful information
Unsubscribe
Hi Spark Community.
I need help with the following issue and I have been researching about it
from last 2 weeks and as a last and best resource I want to ask the Spark
community.
I am running the following code in Spark*
* val sparkConf = new SparkConf()*
*.setMaster("local[*]")*
unsubscribe
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Thank you Hemant and Enrico. Much appreciated.
your input really got me closer to the issue, I realized every task didn't
get enough memory and hence tasks with large partitions kept failing. I
increased executor memory and at the same time increased number of
partitions as well. This made the
*The distinct transformation does not preserve order, you need to distinct
first, then orderby.*
Thanks Enrico. You are correct. Worked fine!
joint_accounts.
select(year(col("transactiondate")).as("Year")
, month(col("transactiondate")).as("Month")
,
or just use SQL, which is less verbose, easily readable, and takes care of
all such scenarios. But for some weird reason I have found that people
using data frame API's have a perception that using SQL is less
intelligent. But I think that using less effort to get better output can me
a measure of
The distinct transformation does not preserve order, you need to
distinct first, then orderby.
Enrico
Am 06.01.20 um 00:39 schrieb Mich Talebzadeh:
Hi,
I am working out monthly outgoing etc from an account and I am using
the following code
import org.apache.spark.sql.expressions.Window
Note that repartitioning helps to increase the number of partitions (and
hence to reduce the size of partitions and required executor memory),
but subsequent transformations like join will repartition data again
with the configured number of partitions
(|spark.sql.shuffle.partitions|),