[no subject]

2019-07-21 Thread Hieu Nguyen
Hi Spark communities, I just found out that in https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.fullOuterJoin, the documentation is "Perform a right outer join of self and other." It should be a full outer join, not a right outer join, as shown in the example and the

spark dataset.cache is not thread safe

2019-07-21 Thread Amit Sharma
Hi , I wrote a code in future block which read data from dataset and cache it which is used later in the code. I faced a issue that data.cached() data will be replaced by concurrent running thread . Is there any way we can avoid this condition. val dailyData = callDetailsDS.collect.toList val

Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Keith Chapman
Hi Alex, Shuffle files in spark are deleted when the object holding a reference to the shuffle file on disk goes out of scope (is garbage collected by the JVM). Could it be the case that you are keeping these objects alive? Regards, Keith. http://keith-chapman.com On Sun, Jul 21, 2019 at

Re: Spark SaveMode

2019-07-21 Thread Mich Talebzadeh
I dug some of my old stuff using Spark as ETL. Regarding the question "Any reason why Spark's SaveMode doesn't have mode that ignore any Primary Key/Unique constraint violations?" There is no way Spark can determine if PK constraint is violated until it receives such message from Oracle through

Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Alex Landa
Thanks, I looked into these options, the cleaner periodic interval is set to 30 min by default. The block option for shuffle - *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by default. What are the implications of setting it to true? Will it make the driver slower? Thanks,

Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Aayush Ranaut
This is the job of ContextCleaner. There are few a property that you can tweak to see if that helps:  spark.cleaner.periodicGC.interval spark.cleaner.referenceTracking spark.cleaner.referenceTracking.blocking.shuffle Regards Prathmesh Ranaut > On Jul 21, 2019, at 11:36 AM, Prathmesh Ranaut

Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Alex Landa
Hi, We are running a long running Spark application ( which executes lots of quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0. We see that old shuffle files ( a week old for example ) are not deleted during the execution of the application, which leads to out of disk space