[jira] [Commented] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

2019-02-24 Thread Han Altae-Tran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776150#comment-16776150
 ] 

Han Altae-Tran commented on SPARK-26907:


Ok, thank you I will try the mailing list.

I think my main point is that Spark could be improved for use with preemptible 
virtual machines if shuffle files can be replicated across the cluster. In my 
experience, whenever there is a shuffle map task, a single node being preempted 
can cause the entire stage to be retried, causing a huge loss of uptime as all 
tasks fail until the retry is initiated. Using persist with replication doesn't 
seem to help this issue, so I figured there is an optimization around shuffle 
files that could be made for this use case.

> Does ShuffledRDD Replication Work With External Shuffle Service
> ---
>
> Key: SPARK-26907
> URL: https://issues.apache.org/jira/browse/SPARK-26907
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, YARN
>Affects Versions: 2.3.2
>Reporter: Han Altae-Tran
>Priority: Major
>
> I am interested in working with high replication environments for extreme 
> fault tolerance (e.g. 10x replication), but have noticed that when using 
> groupBy or groupWith followed by persist (with 10x replication), even if one 
> node fails, the entire stage can fail with FetchFailedException.
>  
> Is this because the External Shuffle Service writes and services intermediate 
> shuffle data only to/from the local disk attached to the executor that 
> generated it, causing spark to ignore possible replicated shuffle data (from 
> the persist) that may be serviced elsewhere? If so, is there any way to 
> increase the replication factor of the External Shuffle Service to make it 
> fault tolerant?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

2019-02-17 Thread Han Altae-Tran (JIRA)
Han Altae-Tran created SPARK-26907:
--

 Summary: Does ShuffledRDD Replication Work With External Shuffle 
Service
 Key: SPARK-26907
 URL: https://issues.apache.org/jira/browse/SPARK-26907
 Project: Spark
  Issue Type: Question
  Components: Block Manager, YARN
Affects Versions: 2.3.2
Reporter: Han Altae-Tran


I am interested in working with high replication environments for extreme fault 
tolerance (e.g. 10x replication), but have noticed that when using groupBy or 
groupWith followed by persist (with 10x replication), even if one node fails, 
the entire stage can fail with FetchFailedException.

 

Is this because the External Shuffle Service writes and services intermediate 
shuffle data only to/from the local disk attached to the executor that 
generated it, causing spark to ignore possible replicated shuffle data (from 
the persist) that may be serviced elsewhere? If so, is there any way to 
increase the replication factor of the External Shuffle Service to make it 
fault tolerant?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26906) Pyspark RDD Replication Potentially Not Working

2019-02-16 Thread Han Altae-Tran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Altae-Tran updated SPARK-26906:
---
Attachment: spark_ui.png

> Pyspark RDD Replication Potentially Not Working
> ---
>
> Key: SPARK-26906
> URL: https://issues.apache.org/jira/browse/SPARK-26906
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 2.3.2
> Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 
> 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018]
>  (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with 
> python version 3.7. PySpark shell is activated using pyspark --num-executors 
> = 100
>Reporter: Han Altae-Tran
>Priority: Minor
> Attachments: spark_ui.png
>
>
> Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
> simple example, the UI reports only 1x replication, despite using the flag 
> for 2x replication
> {code:java}
> rdd = sc.range(10**9)
> mapped = rdd.map(lambda x: x)
> mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
> PythonRDD.scala:52
> mapped.count(){code}
>  
> Interestingly, if you catch the UI page at just the right time, you see that 
> it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the 
> RDD is replicated, but it is just the UI that is unable to register this.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26906) Pyspark RDD Replication Potentially Not Working

2019-02-16 Thread Han Altae-Tran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Han Altae-Tran updated SPARK-26906:
---
   Priority: Minor  (was: Major)
Description: 
Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
simple example, the UI reports only 1x replication, despite using the flag for 
2x replication
{code:java}
rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
PythonRDD.scala:52

mapped.count(){code}
 

Interestingly, if you catch the UI page at just the right time, you see that it 
starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the RDD 
is replicated, but it is just the UI that is unable to register this.  

  was:
Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
simple example, the UI reports only 1x replication, despite using the flag for 
2x replication
{code:java}
rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
PythonRDD.scala:52

mapped.count(){code}
 

resulting in the following:

!image-2019-02-17-01-33-08-551.png!

 

Interestingly, if you catch the UI page at just the right time, you see that it 
starts off 2x replicated:

 

!image-2019-02-17-01-35-37-034.png!

 

but ends up going back to 1x replicated once the RDD is fully materialized. 
This is likely not a UI bug because the cached partitions page also shows only 
1x replication:

 

!image-2019-02-17-01-36-55-418.png!

 

This could result from some type of optimization for replication, but is 
undesirable for users that want a specific level of replication for fault 
tolerance. 

Summary: Pyspark RDD Replication Potentially Not Working  (was: Pyspark 
RDD Replication Not Working)

> Pyspark RDD Replication Potentially Not Working
> ---
>
> Key: SPARK-26906
> URL: https://issues.apache.org/jira/browse/SPARK-26906
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Web UI
>Affects Versions: 2.3.2
> Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 
> 2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018]
>  (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with 
> python version 3.7. PySpark shell is activated using pyspark --num-executors 
> = 100
>Reporter: Han Altae-Tran
>Priority: Minor
>
> Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
> simple example, the UI reports only 1x replication, despite using the flag 
> for 2x replication
> {code:java}
> rdd = sc.range(10**9)
> mapped = rdd.map(lambda x: x)
> mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
> PythonRDD.scala:52
> mapped.count(){code}
>  
> Interestingly, if you catch the UI page at just the right time, you see that 
> it starts off 2x replicated, but ends up 1x replicated afterward. Perhaps the 
> RDD is replicated, but it is just the UI that is unable to register this.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26906) Pyspark RDD Replication Not Working

2019-02-16 Thread Han Altae-Tran (JIRA)
Han Altae-Tran created SPARK-26906:
--

 Summary: Pyspark RDD Replication Not Working
 Key: SPARK-26906
 URL: https://issues.apache.org/jira/browse/SPARK-26906
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Web UI
Affects Versions: 2.3.2
 Environment: I am using Google Cloud's Dataproc version [1.3.19-deb9 
2018/12/14|https://cloud.google.com/dataproc/docs/release-notes#december_14_2018]
 (version 2.3.2 Spark and version 2.9.0 Hadoop) with version Debian 9, with 
python version 3.7. PySpark shell is activated using pyspark --num-executors = 
100
Reporter: Han Altae-Tran


Pyspark RDD replication doesn't seem to be functioning properly. Even with a 
simple example, the UI reports only 1x replication, despite using the flag for 
2x replication
{code:java}
rdd = sc.range(10**9)
mapped = rdd.map(lambda x: x)
mapped.persist(pyspark.StorageLevel.DISK_ONLY_2) \\ PythonRDD[1] at RDD at 
PythonRDD.scala:52

mapped.count(){code}
 

resulting in the following:

!image-2019-02-17-01-33-08-551.png!

 

Interestingly, if you catch the UI page at just the right time, you see that it 
starts off 2x replicated:

 

!image-2019-02-17-01-35-37-034.png!

 

but ends up going back to 1x replicated once the RDD is fully materialized. 
This is likely not a UI bug because the cached partitions page also shows only 
1x replication:

 

!image-2019-02-17-01-36-55-418.png!

 

This could result from some type of optimization for replication, but is 
undesirable for users that want a specific level of replication for fault 
tolerance. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org