Re: Removing duplicates from dataframe

Ross.Cramblit Mon, 07 Dec 2015 11:04:19 -0800

Here is the trace I get from the command line:
[Stage 4:================>                                      (60 + 60) / 
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkYarnAM@10.0.0.138:33822] has failed, address is now 
gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-138.ec2.internal:54951] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 ERROR YarnScheduler: Lost executor 3 on 
ip-10-0-0-138.ec2.internal: remote Rpc client disassociated
15/12/07 18:59:41 WARN TaskSetManager: Lost task 62.0 in stage 4.0 (TID 2003, 
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
15/12/07 18:59:41 WARN TaskSetManager: Lost task 65.0 in stage 4.0 (TID 2006, 
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
…
…




On Dec 7, 2015, at 1:33 PM, Cramblit, Ross (Reuters News) 
<ross.cramb...@thomsonreuters.com<mailto:ross.cramb...@thomsonreuters.com>> 
wrote:

I have looked through the logs and do not see any WARNING or ERRORs - the 
executors just seem to stop logging.

I am running Spark 1.5.2 on YARN.

On Dec 7, 2015, at 1:20 PM, Ted Yu 
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:

bq. complete a shuffle stage due to lost executors

Have you taken a look at the log for the lost executor(s) ?

Which release of Spark are you using ?

Cheers

On Mon, Dec 7, 2015 at 10:12 AM, 
<ross.cramb...@thomsonreuters.com<mailto:ross.cramb...@thomsonreuters.com>> 
wrote:
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: Removing duplicates from dataframe

Reply via email to