I have looked through the logs and do not see any WARNING or ERRORs - the executors just seem to stop logging.
I am running Spark 1.5.2 on YARN. On Dec 7, 2015, at 1:20 PM, Ted Yu <[email protected]<mailto:[email protected]>> wrote: bq. complete a shuffle stage due to lost executors Have you taken a look at the log for the lost executor(s) ? Which release of Spark are you using ? Cheers On Mon, Dec 7, 2015 at 10:12 AM, <[email protected]<mailto:[email protected]>> wrote: I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it turns out there are a number of duplicate JSON objects in the source data. I am trying to find the best way to remove these duplicates before using the dataframe. With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) the application is not able to complete a shuffle stage due to lost executors. Is there a more efficient way to remove these duplicate rows? If not, what settings can I tweak to help this succeed? I have tried both increasing and decreasing the number of default shuffle partitions (to 100 and 500, respectively) but neither changes the behavior. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected]<mailto:[email protected]> For additional commands, e-mail: [email protected]<mailto:[email protected]>
