[jira] [Updated] (SPARK-3562) Periodic cleanup event logs

2014-09-17 Thread xukun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xukun updated SPARK-3562:
-
Summary: Periodic cleanup event logs  (was: Periodic cleanup)

 Periodic cleanup event logs
 ---

 Key: SPARK-3562
 URL: https://issues.apache.org/jira/browse/SPARK-3562
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: xukun

  If we run spark application frequently, it will write many spark event log 
 into spark.eventLog.dir. After a long time later, there will be many spark 
 event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
 will ensure that logs older than this duration will be forgotten. It is no 
 need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated SPARK-3563:

Affects Version/s: 1.0.2

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated SPARK-3563:

Description: 
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be clear, there is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.


 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be clear, there is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated SPARK-3563:

Description: 
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be clear, there is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.

In ContextCleaner, it maintains a weak reference for each RDD, 
ShuffleDependency, and Broadcast of interest,  to be processed when the 
associated object goes out of scope of the application. Actual  cleanup is 
performed in a separate daemon thread. 

There must be some  reference for ShuffleDependency , and it's hard to find out.


  was:
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be clear, there is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.



 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be clear, there is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3560) In yarn-cluster mode, jars are distributed through multiple mechanisms.

2014-09-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136882#comment-14136882
 ] 

Sandy Ryza commented on SPARK-3560:
---

Right.  I believe Min from LinkedIn who discovered the issue is planning to 
submit a patch.

 In yarn-cluster mode, jars are distributed through multiple mechanisms.
 ---

 Key: SPARK-3560
 URL: https://issues.apache.org/jira/browse/SPARK-3560
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Priority: Critical

 In yarn-cluster mode, jars given to spark-submit's --jars argument should be 
 distributed to executors through the distributed cache, not through fetching.
 Currently, Spark tries to distribute the jars both ways, which can cause 
 executor errors related to trying to overwrite symlinks without write 
 permissions.
 It looks like this was introduced by SPARK-2260, which sets spark.jars in 
 yarn-cluster mode.  Setting spark.jars is necessary for standalone cluster 
 deploy mode, but harmful for yarn cluster deploy mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated SPARK-3563:

Description: 
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be clear, here is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.

In ContextCleaner, it maintains a weak reference for each RDD, 
ShuffleDependency, and Broadcast of interest,  to be processed when the 
associated object goes out of scope of the application. Actual  cleanup is 
performed in a separate daemon thread. 

There must be some  reference for ShuffleDependency , and it's hard to find out.


  was:
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be clear, there is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.

In ContextCleaner, it maintains a weak reference for each RDD, 
ShuffleDependency, and Broadcast of interest,  to be processed when the 
associated object goes out of scope of the application. Actual  cleanup is 
performed in a separate daemon thread. 

There must be some  reference for ShuffleDependency , and it's hard to find out.



 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be clear, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenhong updated SPARK-3563:

Description: 
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be cleaned, here is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.

In ContextCleaner, it maintains a weak reference for each RDD, 
ShuffleDependency, and Broadcast of interest,  to be processed when the 
associated object goes out of scope of the application. Actual  cleanup is 
performed in a separate daemon thread. 

There must be some  reference for ShuffleDependency , and it's hard to find out.


  was:
In our cluster, when we run a spark streaming job, after running for many 
hours, the shuffle data seems not all be clear, here is the shuffle data:
-rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
-rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
-rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
-rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
-rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
-rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
-rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
-rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
-rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
-rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
-rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
-rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
-rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1

The shuffle data of stage 63, 65, 84, 132... are not cleaned.

In ContextCleaner, it maintains a weak reference for each RDD, 
ShuffleDependency, and Broadcast of interest,  to be processed when the 
associated object goes out of scope of the application. Actual  cleanup is 
performed in a separate daemon thread. 

There must be some  reference for ShuffleDependency , and it's hard to find out.



 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136899#comment-14136899
 ] 

Sean Owen commented on SPARK-3563:
--

I am no expert, but I believe this is on purpose, in order to reuse the shuffle 
if the RDD partition needs to be recomputed. You may need to set a lower 
spark.cleaner.ttl?

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136907#comment-14136907
 ] 

Apache Spark commented on SPARK-3550:
-

User 'OdinLin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2423

 Disable automatic rdd caching in python api for relevant learners
 -

 Key: SPARK-3550
 URL: https://issues.apache.org/jira/browse/SPARK-3550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 The python mllib api automatically caches training rdds. However, the 
 NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
 prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
 its input RDD once, while ALS and DecisionTree internally persist 
 transformations of their input RDDs. For these learners, we should disable 
 the automatic caching in the python mllib api.
 See discussion here:
 https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136920#comment-14136920
 ] 

Evan Chan commented on SPARK-2593:
--

[~pwendell] I'd have to agree with Helena and Tupshin;  I'd like to see the
usage of Akka increase; it is underutilized.  Also note that it's not true
that Akka is only used for RPC;  there are Actors used in several places.

Performance and resilience would likely improve significantly in several
areas with more usage of Akka;  the current threads spun up by each driver
for various things like a file server are quite wasteful.

On Tue, Sep 16, 2014 at 5:17 PM, Tupshin Harper (JIRA) j...@apache.org




-- 
The fruit of silence is prayer;
the fruit of prayer is faith;
the fruit of faith is love;
the fruit of love is service;
the fruit of service is peace.  -- Mother Teresa


 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3564) Display App ID on HistoryPage

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136929#comment-14136929
 ] 

Apache Spark commented on SPARK-3564:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2424

 Display App ID on HistoryPage
 -

 Key: SPARK-3564
 URL: https://issues.apache.org/jira/browse/SPARK-3564
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 Current HistoryPage display doesn't display App ID so if there are lots of 
 applications which have same name, it's difficult to find an application we'd 
 like to know it's status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3566) .gitignore and .rat-excludes should consider cmd file and Emacs' backup files

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136947#comment-14136947
 ] 

Apache Spark commented on SPARK-3566:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2426

 .gitignore and .rat-excludes should consider cmd file and Emacs' backup files
 -

 Key: SPARK-3566
 URL: https://issues.apache.org/jira/browse/SPARK-3566
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor

 Current .gitignore and .rat-excludes does not consider spark-env.cmd.
 Also, .gitignore doesn't consider emacs' meta files (backup file which starts 
 with and ends with # and lock file which starts with .#) even though 
 considers vi's meta file (*.swp).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3565) make code consistent with document

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136948#comment-14136948
 ] 

Apache Spark commented on SPARK-3565:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/2427

 make code consistent with document
 --

 Key: SPARK-3565
 URL: https://issues.apache.org/jira/browse/SPARK-3565
 Project: Spark
  Issue Type: Bug
  Components: core
Reporter: WangTaoTheTonic
Priority: Minor

 The configuration item represent Default number of retries in binding to a 
 port in code is spark.ports.maxRetries while spark.port.maxRetries in 
 document configuration.md. We need to make them consistent.
  
 In org.apache.spark.util.Utils.scala:
   /**
* Default number of retries in binding to a port.
*/
   val portMaxRetries: Int = {
 if (sys.props.contains(spark.testing)) {
   // Set a higher number of retries for tests...
   sys.props.get(spark.port.maxRetries).map(_.toInt).getOrElse(100)
 } else {
   Option(SparkEnv.get)
 .flatMap(_.conf.getOption(spark.port.maxRetries))
 .map(_.toInt)
 .getOrElse(16)
 }
   }
 In configuration.md:
 tr
   tdcodespark.port.maxRetries/code/td
   td16/td
   td
 Maximum number of retries when binding to a port before giving up.
   /td
 /tr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-09-17 Thread guowei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136999#comment-14136999
 ] 

guowei commented on SPARK-3292:
---

[~saisai_shao]
i test  the scenario with windowing operators, it seems OK.

WindowedDstream compute base on the slice RDD  cached between the from and end 
time,  it seems does not need commit job 

 Shuffle Tasks run incessantly even though there's no inputs
 ---

 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei

 such as repartition groupby join and cogroup
 for example. 
 if i want the shuffle outputs save as hadoop file ,even though  there is no 
 inputs , many emtpy file generate too.
 it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137014#comment-14137014
 ] 

shenhong commented on SPARK-3563:
-

Thanks Sean Owen!
I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why 
some shuffle stage data have been cleaned, but the other are not.

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137014#comment-14137014
 ] 

shenhong edited comment on SPARK-3563 at 9/17/14 9:55 AM:
--

Thanks Sean Owen!
I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why 
some shuffle stages data have been cleaned, but the others are not.


was (Author: shenhong):
Thanks Sean Owen!
I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why 
some shuffle stage data have been cleaned, but the other are not.

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137014#comment-14137014
 ] 

shenhong edited comment on SPARK-3563 at 9/17/14 9:59 AM:
--

Thanks Sean Owen!
I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why 
some shuffle stages data have been cleaned, while the others are not.


was (Author: shenhong):
Thanks Sean Owen!
I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why 
some shuffle stages data have been cleaned, but the others are not.

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137033#comment-14137033
 ] 

Apache Spark commented on SPARK-3567:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2428

 appId field in SparkDeploySchedulerBackend should be volatile
 -

 Key: SPARK-3567
 URL: https://issues.apache.org/jira/browse/SPARK-3567
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 appId field in SparkDeploySchedulerBackend is set by 
 AppClient.ClientActor#receiveWithLogging and appId is referred via 
 SparkDeploySchedulerBackend#applicationId.
 A thread which runs AppClient.ClientActor and a thread invoking 
 SparkDeploySchedulerBackend#applicationId can be another threads so appId 
 should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile

2014-09-17 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3567:
-

 Summary: appId field in SparkDeploySchedulerBackend should be 
volatile
 Key: SPARK-3567
 URL: https://issues.apache.org/jira/browse/SPARK-3567
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


appId field in SparkDeploySchedulerBackend is set by 
AppClient.ClientActor#receiveWithLogging and appId is referred via 
SparkDeploySchedulerBackend#applicationId.

A thread which runs AppClient.ClientActor and a thread invoking 
SparkDeploySchedulerBackend#applicationId can be another threads so appId 
should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1719) spark.executor.extraLibraryPath isn't applied on yarn

2014-09-17 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137176#comment-14137176
 ] 

Wilfred Spiegelenburg commented on SPARK-1719:
--

oops, not sure how I missed that last file change :-(

 spark.executor.extraLibraryPath isn't applied on yarn
 -

 Key: SPARK-1719
 URL: https://issues.apache.org/jira/browse/SPARK-1719
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Guoqiang Li
 Fix For: 1.2.0


 Looking through the code for spark on yarn I don't see that 
 spark.executor.extraLibraryPath is being properly applied when it launches 
 executors.  It is using the spark.driver.libraryPath in the ClientBase.
 Note I didn't actually test it so its possible I missed something.
 I also think better to use LD_LIBRARY_PATH rather then -Djava.library.path.  
 once  java.library.path is set, it doesn't search LD_LIBRARY_PATH.  In Hadoop 
 we switched to use LD_LIBRARY_PATH instead of java.library.path.  See 
 https://issues.apache.org/jira/browse/MAPREDUCE-4072.  I'll split this into 
 separate jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137246#comment-14137246
 ] 

Saisai Shao commented on SPARK-3563:


In my thought, I think it relies on JVM's GC strategy, it is a unpredictable 
behavior when this object will be gc-ed and the shuffle files  be cleaned. 
Anyway, finally the shuffle files will be cleaned when the application is 
exited. Just my thought, please correct me if I'm wrong.

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod). 
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
  case yarn-client =
  case mesosUrl @ MESOS_REGEX(_) =
  . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal



 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137338#comment-14137338
 ] 

Helena Edelson commented on SPARK-2593:
---

[~pwendell] I forgot to not this on my reply above but I was offering to do the 
work myself - or at least start it.

What would be ideal is to have both of these:
- Expose the spark actor system which also requires insuring all spark actors 
are on specified dispatchers (very important)
- Optionally allow spark users to pass in their existing actor system

I feel both are incredibly important for users.

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137338#comment-14137338
 ] 

Helena Edelson edited comment on SPARK-2593 at 9/17/14 2:42 PM:


[~pwendell] I forgot to note this on my reply above but I was offering to do 
the work myself - or at least start it.

What would be ideal is to have both of these:
- Expose the spark actor system which also requires insuring all spark actors 
are on specified dispatchers (very important)
- Optionally allow spark users to pass in their existing actor system

I feel both are incredibly important for users.


was (Author: helena_e):
[~pwendell] I forgot to not this on my reply above but I was offering to do the 
work myself - or at least start it.

What would be ideal is to have both of these:
- Expose the spark actor system which also requires insuring all spark actors 
are on specified dispatchers (very important)
- Optionally allow spark users to pass in their existing actor system

I feel both are incredibly important for users.

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137338#comment-14137338
 ] 

Helena Edelson edited comment on SPARK-2593 at 9/17/14 2:44 PM:


[~pwendell] I forgot to note this on my reply above but I was offering to do 
the work myself - or at least start it.

What would be ideal is to have both of these:
- Expose the spark actor system which also requires insuring all spark actors 
are on specified dispatchers (very important)
- Optionally allow spark users to pass in their existing actor system
- Add a logical naming convention for spark streaming actors or a function to 
get it

I feel both are incredibly important for users.


was (Author: helena_e):
[~pwendell] I forgot to note this on my reply above but I was offering to do 
the work myself - or at least start it.

What would be ideal is to have both of these:
- Expose the spark actor system which also requires insuring all spark actors 
are on specified dispatchers (very important)
- Optionally allow spark users to pass in their existing actor system

I feel both are incredibly important for users.

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread shenhong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137348#comment-14137348
 ] 

shenhong commented on SPARK-3563:
-

Thanks, Saisai. I thank you are right, it depend on JVM's GC. But in my 
streaming job, one minute a batch,  two stage a batch, and it contain a shuffle 
stage. In the first hour(60 batches), shuffle data had been cleaned, but after 
that, shuffle data not always be cleaned. And streaming job won't stop.

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: core
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 2:55 PM:
--

Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod) if such 
environment is not supported.
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
  case yarn-client =
  case mesosUrl @ MESOS_REGEX(_) =
  . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal




was (Author: ozhurakousky):
Patrick, thanks for following up.

Indeed Spark does provide first-class extensibility mechanism at many different 
levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing 
a crucial one and that is the execution context”. And while SparkContext 
itself could easily be extended or mixed in with a custom trait to achieve such 
customization, it is less then ideal extension mechanism, since it would 
require code modification every time user wants to swap an execution 
environment (e.g., from “local” in testing to “yarn” in prod). 
And in fact Spark already supports an externally configurable model where the 
target execution environment is managed through “master URL. However, the 
_nature_, _implementation_ and most importantly _customization_ of these 
environments are internal to Spark. 
{code}
master match {
  case yarn-client =
  case mesosUrl @ MESOS_REGEX(_) =
  . . .
}
{code}
Further more, any additional integration and/or customization work that may 
come in the future would require modification to the above _case_ statement 
which I am also sure you’d agree is less then ideal integration style, since it 
would require a new release of Spark every time new _case_ statement is added. 
So essentially what we’re proposing is to formalize what has always been 
supported by Spark to an externally configurable model so customization around 
_*native functionality*_ of the target execution environment could be handled 
in a flexible and pluggable way.

So in this model we are simply proposing a variation of the chain of 
responsibility pattern” where DAG execution could be delegated to an _execution 
context_ with no change to end user programs or semantics. 
Based on our investigation we’ve identified 4 core operations which you can see 
in _JobExecutionContext_.
Two of them provide access to source RDD creation thus allowing customization 
of data _sourcing_ (custom readers, direct block access etc.).  One for 
_broadcast_ to integrate with broadcast capabilities provided natively. And 
last but not least is the main _execution delegate_ for the job - “runJob”.

And while I am sure there will be more questions, I hope the above response 
clarifies the overall intention of this proposal



 Native Hadoop/YARN integration for batch/ETL workloads
 

[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, sorry as I feel like I missed the core emphasis of what we are trying 
to accomplish with this.
Our main goal is to expose Spark to native Hadoop features (i.e., stateless 
YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark 
such as interactive, in-memory and streaming to batch and ETL in a shared, 
multi-tenant environments, thus benefiting Spark community considerably by 
allowing Spark to be applied for all use-cases and capabilties on and in Hadoop.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 3:24 PM:
--

Patrick, sorry as I feel like I missed the core emphasis of what we are trying 
to accomplish with this.
As described in the design document attached, our main goal is to expose Spark 
to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus 
increasing the existing capabilities of Spark such as interactive, in-memory 
and streaming to batch and ETL in a shared, multi-tenant environments, thus 
benefiting Spark community considerably by allowing Spark to be applied for all 
use-cases and capabilties on and in Hadoop.


was (Author: ozhurakousky):
Patrick, sorry as I feel like I missed the core emphasis of what we are trying 
to accomplish with this.
Our main goal is to expose Spark to native Hadoop features (i.e., stateless 
YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark 
such as interactive, in-memory and streaming to batch and ETL in a shared, 
multi-tenant environments, thus benefiting Spark community considerably by 
allowing Spark to be applied for all use-cases and capabilties on and in Hadoop.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3177) Yarn-alpha ClientBaseSuite Unit test failed

2014-09-17 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3177.
--
   Resolution: Fixed
Fix Version/s: (was: 1.1.1)
   1.2.0

 Yarn-alpha ClientBaseSuite Unit test failed
 ---

 Key: SPARK-3177
 URL: https://issues.apache.org/jira/browse/SPARK-3177
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.1
Reporter: Chester
Priority: Minor
  Labels: test
 Fix For: 1.2.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Yarn-alpha ClientBaseSuite Unit test failed due to differences of MRJobConfig 
 API between yarn-stable and yarn-alpha. 
 The class field
 MRJobConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH 
 in yarn-alpha 
 returns String Array
 in yarn 
 returns String 
 the method will works for yarn-stable but will fail as it try to cast String 
 Array to String. 
 val knownDefMRAppCP: Seq[String] =
  getFieldValue[String, Seq[String]](classOf[MRJobConfig],
 DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH,
  Seq[String]())(a = a.split(,))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-17 Thread kannapiran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137529#comment-14137529
 ] 

kannapiran commented on SPARK-2593:
---

Is there a way to add akka system in to spark streaming context and write the 
output to hdfs from spark streaming.

I tried 

ActorReceiver rec = new ActorReceiver(props, TestActor, 
StorageLevel.MEMORY_AND_DISK_2(), SupervisorStrategy.defaultStrategy(), null);
ReceiverInputDStreamObject receiverStream = context.receiverStream(rec, null);
receiverStream.print();

But the above code doesnt stream anything from actor system

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3074) support groupByKey() with hot keys in PySpark

2014-09-17 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3074:
--
Component/s: PySpark

 support groupByKey() with hot keys in PySpark
 -

 Key: SPARK-3074
 URL: https://issues.apache.org/jira/browse/SPARK-3074
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3371:

Target Version/s: 1.2.0

 Spark SQL: Renaming a function expression with group by gives error
 ---

 Key: SPARK-3371
 URL: https://issues.apache.org/jira/browse/SPARK-3371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Pei-Lun Lee

 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val rdd = sc.parallelize(List({foo:bar}))
 sqlContext.jsonRDD(rdd).registerAsTable(t1)
 sqlContext.registerFunction(len, (s: String) = s.length)
 sqlContext.sql(select len(foo) as a, count(1) from t1 group by 
 len(foo)).collect()
 {code}
 running above code in spark-shell gives the following error
 {noformat}
 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: foo#0
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 {noformat}
 remove as a in the query causes no error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3537) Statistics for cached RDDs

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3537:

Assignee: Cheng Lian

 Statistics for cached RDDs
 --

 Key: SPARK-3537
 URL: https://issues.apache.org/jira/browse/SPARK-3537
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 Right now we only have limited statistics for hive tables.  We could easily 
 collect this data when caching an RDD as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3377) Metrics can be accidentally aggregated against our intention

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137669#comment-14137669
 ] 

Apache Spark commented on SPARK-3377:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2432

 Metrics can be accidentally aggregated against our intention
 

 Key: SPARK-3377
 URL: https://issues.apache.org/jira/browse/SPARK-3377
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Critical

 I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I 
 saw following 2 problems.
 (1) When applications which have same spark.app.name run on cluster at the 
 same time, some metrics names are mixed. For instance, if 2+ application is 
 running on the cluster at the same time, each application emits the same 
 named metric like  SparkPi.DAGScheduler.stage.failedStages and Graphite 
 cannot distinguish the metrics is for which application.
 (2) When 2+ executors run on the same machine, JVM metrics of each executors 
 are mixed. For instance, 2+ executors running on the same node can emit the 
 same named metric jvm.memory and Graphite cannot distinguish the metrics is 
 from which application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Shuo Xiang (JIRA)
Shuo Xiang created SPARK-3568:
-

 Summary: Add metrics for ranking algorithms
 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang


Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2902) Change default options to be more agressive

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2902:

Summary: Change default options to be more agressive  (was: Enable 
compression for in-memory columnar storage by default)

 Change default options to be more agressive
 ---

 Key: SPARK-2902
 URL: https://issues.apache.org/jira/browse/SPARK-2902
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian

 Compression for in-memory columnar storage is disabled by default, it's time 
 to enable it. Also, it help alleviating OOM mentioned in SPARK-2650



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2902) Change default options to be more agressive

2014-09-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137698#comment-14137698
 ] 

Michael Armbrust commented on SPARK-2902:
-

Things we might consider changing: parquet auto conversion, increase batch 
size, turn on compression.

As part of this it would be great if we read from spark defaults at startup so 
these could be turned off without recompiling.

 Change default options to be more agressive
 ---

 Key: SPARK-2902
 URL: https://issues.apache.org/jira/browse/SPARK-2902
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian

 Compression for in-memory columnar storage is disabled by default, it's time 
 to enable it. Also, it help alleviating OOM mentioned in SPARK-2650



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2271:

Target Version/s: 1.2.0

 Use Hive's high performance Decimal128 to replace BigDecimal
 

 Key: SPARK-2271
 URL: https://issues.apache.org/jira/browse/SPARK-2271
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian

 Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2063.
-
   Resolution: Duplicate
Fix Version/s: 1.2.0

 Creating a SchemaRDD via sql() does not correctly resolve nested types
 --

 Key: SPARK-2063
 URL: https://issues.apache.org/jira/browse/SPARK-2063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Cheng Lian
 Fix For: 1.2.0


 For example, from the typical twitter dataset:
 {code}
 scala val popularTweets = sql(SELECT retweeted_status.text, 
 MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status 
 is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30)
 scala popularTweets.toString
 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch MultiInstanceRelations
 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for 
 batch CaseInsensitiveAttributeReferences
 org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
 qualifiers on unresolved object, tree: 'retweeted_status.text
   at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
   at 
 org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217)
   at 
 

[jira] [Resolved] (SPARK-1694) Simplify ColumnBuilder/Accessor class hierarchy

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1694.
-
Resolution: Fixed

 Simplify ColumnBuilder/Accessor class hierarchy
 ---

 Key: SPARK-1694
 URL: https://issues.apache.org/jira/browse/SPARK-1694
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.2.0


 Current {{ColumnBuilder/Accessor}} class hierarchy design was largely 
 refactored from the in-memory columnar storage component of Shark. Code 
 related to null values and compression were factored into 
 {{NullableColumnBuilder/Accessor}} and {{CompressibleColumnBuilder/Accessor}} 
 and then mixed in as stackable traits. The drawback is:
 # Interactions among these classes were unnecessarily complicated and error 
 prone.
 # Flexibility provided by this design now seems useless
 To simplify this, we can merge {{CompressibleColumnBuilder/Accessor}} and 
 {{NullableColumnBuilder/Accessor}} into {{NativeColumnBuilder/Accessor}}, 
 simply hard code null value processing and compression logic together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3568:
-
Priority: Minor  (was: Major)

 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang
Priority: Minor

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3568:
-
Assignee: Shuo Xiang

 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137719#comment-14137719
 ] 

Matei Zaharia commented on SPARK-3530:
--

To comment on the versioning stuff here, deprecated doesn't mean unsupported, 
it just means we encourage using something else. So the old MLlib API will 
remain in 1.x, and will continue getting tested and bug-fixed, but it will not 
get new features.

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-09-17 Thread Ezequiel Bella (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ezequiel Bella updated SPARK-3553:
--
Description: 
We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of 
RAM each.
The app streams from a directory in S3 which is constantly being written; this 
is the line of code that achieves that:

val lines = ssc.fileStream[LongWritable, Text, 
TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )

The purpose of using fileStream instead of textFileStream is to customize the 
way that spark handles existing files when the process starts. We want to 
process just the new files that are added after the process launched and omit 
the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 
or 5. We can see in the streaming UI how the stages are executed successfully 
in the executors, one for each file that is processed. But when we try to add a 
larger number of files, we face a strange behavior; the application starts 
streaming files that have already been streamed. 
For example, I add 20 files to s3. The files are processed in 3 batches. The 
first batch processes 7 files, the second 8 and the third 5. No more files are 
added to S3 at this point, but spark start repeating these phases endlessly 
with the same files.
Any thoughts what can be causing this?
Regards,
Easyb


  was:
We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of 
RAM each.
The app streams from a directory in S3 which is constantly being written; this 
is the line of code that achieves that:

val lines = ssc.fileStream[LongWritable, Text, 
TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )

The purpose of using fileStream instead of textFileStream is to customize the 
way that spark handles existing files when the process starts. We want to 
process just the new files that are added after the process launched and omit 
the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 
or 5. We can see in the streaming UI how the stages are executed successfully 
in the executors, one for each file that is processed. But when we try to add a 
larger number of files, we face a strange behavior; the application starts 
streaming files that have already being streamed. 
For example, I add 20 files to s3. The files are processed in 3 batches. The 
first batch processes 7 files, the second 8 and the third 5. No more files are 
added to S3 at this point, but spark start repeating these phases endlessly.
Any thoughts what can be causing this?
Regards,
Easyb



 Spark Streaming app streams files that have already been streamed in an 
 endless loop
 

 Key: SPARK-3553
 URL: https://issues.apache.org/jira/browse/SPARK-3553
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.1
 Environment: Ec2 cluster - YARN
Reporter: Ezequiel Bella
  Labels: S3, Streaming, YARN

 We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
 and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
 of RAM each.
 The app streams from a directory in S3 which is constantly being written; 
 this is the line of code that achieves that:
 val lines = ssc.fileStream[LongWritable, Text, 
 TextInputFormat](Settings.S3RequestsHost  , (f:Path)= true, true )
 The purpose of using fileStream instead of textFileStream is to customize the 
 way that spark handles existing files when the process starts. We want to 
 process just the new files that are added after the process launched and omit 
 the existing ones. We configured a batch duration of 10 seconds.
 The process goes fine while we add a small number of files to s3, let's say 4 
 or 5. We can see in the streaming UI how the stages are executed successfully 
 in the executors, one for each file that is processed. But when we try to add 
 a larger number of files, we face a strange behavior; the application starts 
 streaming files that have already been streamed. 
 For example, I add 20 files to s3. The files are processed in 3 batches. The 
 first batch processes 7 files, the second 8 and the third 5. No more files 
 are added to S3 at this point, but spark start repeating these phases 
 endlessly with the same files.
 Any thoughts what can be causing this?
 Regards,
 Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2014-09-17 Thread Eustache (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137758#comment-14137758
 ] 

Eustache commented on SPARK-3530:
-

Great to see the design docs !

A few questions/remarks:

- Big +1 for Pipeline and Dataset as first-class abstractions - being a long 
time sklearn user Pipelines are a very convenient way to think for many 
problems, e.g. implementing Cascades of models integrate unsupervised steps for 
feature transformation in a supervised task etc

- Isn't the fit multiple models at once part a bit of an early optimization ? 
How many users would benefit from it ? IMHO it complicates the API for most 
users.

- I'm also wondering if a meta class wouldn't be capable of doing multiple 
models. AFAICT fitting multiple models at once resembles a parameter grid 
search isn't it? I assume the later would return evaluation metrics for each 
parameter set as well as the model itself, right ?

- It seems to me that multi-task learning would be a good example for the 
multiple models at once but is maybe not a typical example of what most users 
would want. Also I'm not 100% sure the implementation should necessarily profit 
from such an API

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-09-17 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137759#comment-14137759
 ] 

Zhan Zhang commented on SPARK-2883:
---

I am starting to prototyping the last feature with OrcFile and saveAsOrcFile.

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-09-17 Thread lee mighdoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137780#comment-14137780
 ] 

lee mighdoll commented on SPARK-2707:
-

Now that 1.1 is out, hopefully a solution for early 1.2 milestone builds? 
Perhaps offer a contract for someone to make a shader plugin for sbt.

I suspect some people still need scala 2.10.x and some need 2.11, so I'd 
recommend a cross build: 
http://www.scala-sbt.org/0.13.5/docs/Detailed-Topics/Cross-Build.html



 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3569) Add metadata field to StructField

2014-09-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3569:


 Summary: Add metadata field to StructField
 Key: SPARK-3569
 URL: https://issues.apache.org/jira/browse/SPARK-3569
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng


Want to add a metadata field to StructField that can be used by other 
applications like ML to embed more information about the column.

{code}
case class case class StructField(name: String, dataType: DataType, nullable: 
Boolean, metadata: Map[String, Any] = Map.empty)
{code}

For ML, we can store feature information like categorical/continuous, number 
categories, category-to-index map, etc




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3534.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Nicholas Chammas

 Avoid running MLlib and Streaming tests when testing SQL PRs
 

 Key: SPARK-3534
 URL: https://issues.apache.org/jira/browse/SPARK-3534
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, SQL
Reporter: Michael Armbrust
Assignee: Nicholas Chammas
Priority: Blocker
 Fix For: 1.2.0


 We are bumping up against the 120 minute time limit for tests pretty 
 regularly now.  Since we have decreased the number of shuffle partitions and 
 up-ed the parallelism I don't think there is much low hanging fruit to speed 
 up the SQL tests. (The tests that are listed as taking 2-3 minutes are 
 actually 100s of tests that I think are valuable).  Instead I propose we 
 avoid running tests that we don't need to.
 This will have the added benefit of eliminating failures in SQL due to flaky 
 streaming tests.
 Note that this won't fix the full builds that are run for every commit.  
 There I think we just just up the test timeout.
 cc: [~joshrosen] [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3569) Add metadata field to StructField

2014-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3569:
-
Component/s: MLlib
 ML

 Add metadata field to StructField
 -

 Key: SPARK-3569
 URL: https://issues.apache.org/jira/browse/SPARK-3569
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, SQL
Reporter: Xiangrui Meng

 Want to add a metadata field to StructField that can be used by other 
 applications like ML to embed more information about the column.
 {code}
 case class case class StructField(name: String, dataType: DataType, nullable: 
 Boolean, metadata: Map[String, Any] = Map.empty)
 {code}
 For ML, we can store feature information like categorical/continuous, number 
 categories, category-to-index map, etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3570) Shuffle write time does not include time to open shuffle files

2014-09-17 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-3570:
-

 Summary: Shuffle write time does not include time to open shuffle 
files
 Key: SPARK-3570
 URL: https://issues.apache.org/jira/browse/SPARK-3570
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.0.2, 0.9.2
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


Currently, the reported shuffle write time does not include time to open the 
shuffle files.  This time can be very significant when the disk is highly 
utilized and many shuffle files exist on the machine (I'm not sure how severe 
this is in 1.0 onward -- since shuffle files are automatically deleted, this 
may be less of an issue because there are fewer old files sitting around).  In 
experiments I did, in extreme cases, adding the time to open files can increase 
the shuffle write time from 5ms (of a 2 second task) to 1 second.  We should 
fix this for better performance debugging.

Thanks [~shivaram] for helping to diagnose this problem.  cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3570) Shuffle write time does not include time to open shuffle files

2014-09-17 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3570:
--
Attachment: 3a_1410957857_0_job_log_waterfall.pdf

 Shuffle write time does not include time to open shuffle files
 --

 Key: SPARK-3570
 URL: https://issues.apache.org/jira/browse/SPARK-3570
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.2, 1.0.2, 1.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 
 3a_1410943402_0_job_log_waterfall.pdf, 3a_1410957857_0_job_log_waterfall.pdf


 Currently, the reported shuffle write time does not include time to open the 
 shuffle files.  This time can be very significant when the disk is highly 
 utilized and many shuffle files exist on the machine (I'm not sure how severe 
 this is in 1.0 onward -- since shuffle files are automatically deleted, this 
 may be less of an issue because there are fewer old files sitting around).  
 In experiments I did, in extreme cases, adding the time to open files can 
 increase the shuffle write time from 5ms (of a 2 second task) to 1 second.  
 We should fix this for better performance debugging.
 Thanks [~shivaram] for helping to diagnose this problem.  cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3570) Shuffle write time does not include time to open shuffle files

2014-09-17 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3570:
--
Attachment: (was: 3a_1410943402_0_job_log_waterfall.pdf)

 Shuffle write time does not include time to open shuffle files
 --

 Key: SPARK-3570
 URL: https://issues.apache.org/jira/browse/SPARK-3570
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.2, 1.0.2, 1.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 
 3a_1410957857_0_job_log_waterfall.pdf


 Currently, the reported shuffle write time does not include time to open the 
 shuffle files.  This time can be very significant when the disk is highly 
 utilized and many shuffle files exist on the machine (I'm not sure how severe 
 this is in 1.0 onward -- since shuffle files are automatically deleted, this 
 may be less of an issue because there are fewer old files sitting around).  
 In experiments I did, in extreme cases, adding the time to open files can 
 increase the shuffle write time from 5ms (of a 2 second task) to 1 second.  
 We should fix this for better performance debugging.
 Thanks [~shivaram] for helping to diagnose this problem.  cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3571) Spark standalone cluster mode doesn't work.

2014-09-17 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3571:
-

 Summary: Spark standalone cluster mode doesn't work.
 Key: SPARK-3571
 URL: https://issues.apache.org/jira/browse/SPARK-3571
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Blocker


Recent changes of Master.scala causes Spark standalone cluster mode not working.

I think, the loop in Master#schedule never assign worker for driver.

{code}
for (driver - waitingDrivers.toList) { // iterate over a copy of 
waitingDrivers
  // We assign workers to each waiting driver in a round-robin fashion. For 
each driver, we
  // start from the last worker that was assigned a driver, and continue 
onwards until we have
  // explored all alive workers.
  curPos = (curPos + 1) % aliveWorkerNum
  val startPos = curPos
  var launched = false
  while (curPos != startPos  !launched) {
val worker = shuffledAliveWorkers(curPos)
if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
driver.desc.cores) {
  launchDriver(worker, driver)
  waitingDrivers -= driver
  launched = true
}
curPos = (curPos + 1) % aliveWorkerNum
  }

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3571) Spark standalone cluster mode doesn't work.

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137884#comment-14137884
 ] 

Apache Spark commented on SPARK-3571:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2436

 Spark standalone cluster mode doesn't work.
 ---

 Key: SPARK-3571
 URL: https://issues.apache.org/jira/browse/SPARK-3571
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Blocker

 Recent changes of Master.scala causes Spark standalone cluster mode not 
 working.
 I think, the loop in Master#schedule never assign worker for driver.
 {code}
 for (driver - waitingDrivers.toList) { // iterate over a copy of 
 waitingDrivers
   // We assign workers to each waiting driver in a round-robin fashion. 
 For each driver, we
   // start from the last worker that was assigned a driver, and continue 
 onwards until we have
   // explored all alive workers.
   curPos = (curPos + 1) % aliveWorkerNum
   val startPos = curPos
   var launched = false
   while (curPos != startPos  !launched) {
 val worker = shuffledAliveWorkers(curPos)
 if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
   launchDriver(worker, driver)
   waitingDrivers -= driver
   launched = true
 }
 curPos = (curPos + 1) % aliveWorkerNum
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3572) Support register UserType in SQL

2014-09-17 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3572:


 Summary: Support register UserType in SQL
 Key: SPARK-3572
 URL: https://issues.apache.org/jira/browse/SPARK-3572
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng


If a user knows how to map a class to a struct type in Spark SQL, he should be 
able to register this mapping through sqlContext and hence SQL can figure out 
the schema automatically.

{code}
trait RowSerializer[T] {
  def dataType: StructType
  def serialize(obj: T): Row
  def deserialize(row: Row): T
}

sqlContext.registerUserType[T](clazz: classOf[T], serializer: 
classOf[RowSerializer[T]])
{code}

In sqlContext, we can maintain a class-to-serializer map and use it for 
conversion. The serializer class can be embedded into the metadata, so when 
`select` is called, we know we want to deserialize the result.

{code}
sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer])
val points: RDD[LabeledPoint] = ...
val features: RDD[Vector] = points.select('features).map { case Row(v: Vector) 
= v }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3573) Dataset

2014-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3573:
-
Shepherd: Michael Armbrust

 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3574) Shuffle finish time always reported as -1

2014-09-17 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-3574:
-

 Summary: Shuffle finish time always reported as -1
 Key: SPARK-3574
 URL: https://issues.apache.org/jira/browse/SPARK-3574
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Kay Ousterhout


shuffleFinishTime is always reported as -1.  I think the way we fix this should 
be to set the shuffleFinishTime in each ShuffleWriteMetrics as the shuffles 
finish, but when aggregating the metrics, only report shuffleFinishTime as 
something other than -1 when *all* of the shuffles have completed.

[~sandyr], it looks like this was introduced in your recent patch to 
incrementally report metrics.  Any chance you can fix this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called `RankingMetrics` 
under `org.apache.spark.mllib.evaluation`, which accepts input (prediction and 
label pairs) as `RDD[Array[Double], Array[Double]]`. Methods of 
`meanAveragePrecision`, `topKPrecision` and `ndcg` will be included.

  was:
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 


 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang
Priority: Minor

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called `RankingMetrics` 
 under `org.apache.spark.mllib.evaluation`, which accepts input (prediction 
 and label pairs) as `RDD[Array[Double], Array[Double]]`. Methods of 
 `meanAveragePrecision`, `topKPrecision` and `ndcg` will be included.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. Methods of 
*meanAveragePrecision*, *topKPrecision* and *ndcg* will be included.

  was:
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called `RankingMetrics` 
under `org.apache.spark.mllib.evaluation`, which accepts input (prediction and 
label pairs) as `RDD[Array[Double], Array[Double]]`. Methods of 
`meanAveragePrecision`, `topKPrecision` and `ndcg` will be included.


 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang
Priority: Minor

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. Methods of 
 *meanAveragePrecision*, *topKPrecision* and *ndcg* will be included.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

 - *averagePrecision(val position: Int): Double* this is the presicion@n
 - *meanAveragePrecision*: the average of precision@n for all values of n
- *ndcg*



  was:
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. Methods of 
*meanAveragePrecision*, *topKPrecision* and *ndcg* will be included.


 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang
Priority: Minor

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
 methods will be implemented:
  - *averagePrecision(val position: Int): Double* this is the presicion@n
  - *meanAveragePrecision*: the average of precision@n for all values of n
 - *ndcg*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-09-17 Thread Shuo Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

 - *averagePrecision(val position: Int): Double* this is the presicion@position
 - *meanAveragePrecision*: the average of precision@n for all values of n
- *ndcg*



  was:
Include widely-used metrics for ranking algorithms, including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

 - *averagePrecision(val position: Int): Double* this is the presicion@n
 - *meanAveragePrecision*: the average of precision@n for all values of n
- *ndcg*




 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang
Priority: Minor

 Include widely-used metrics for ranking algorithms, including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[Double], Array[Double]]*. The following 
 methods will be implemented:
  - *averagePrecision(val position: Int): Double* this is the 
 presicion@position
  - *meanAveragePrecision*: the average of precision@n for all values of n
 - *ndcg*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3569) Add metadata field to StructField

2014-09-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3569:
-
Description: 
Want to add a metadata field to StructField that can be used by other 
applications like ML to embed more information about the column.

{code}
case class case class StructField(name: String, dataType: DataType, nullable: 
Boolean, metadata: Map[String, Any] = Map.empty)
{code}

For ML, we can store feature information like categorical/continuous, number 
categories, category-to-index map, etc.

One question is how to carry over the metadata in query execution. For example:

{code}
val features = schemaRDD.select('features)
val featuresDesc = features.schema(features).metadata
{code}

  was:
Want to add a metadata field to StructField that can be used by other 
applications like ML to embed more information about the column.

{code}
case class case class StructField(name: String, dataType: DataType, nullable: 
Boolean, metadata: Map[String, Any] = Map.empty)
{code}

For ML, we can store feature information like categorical/continuous, number 
categories, category-to-index map, etc



 Add metadata field to StructField
 -

 Key: SPARK-3569
 URL: https://issues.apache.org/jira/browse/SPARK-3569
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib, SQL
Reporter: Xiangrui Meng

 Want to add a metadata field to StructField that can be used by other 
 applications like ML to embed more information about the column.
 {code}
 case class case class StructField(name: String, dataType: DataType, nullable: 
 Boolean, metadata: Map[String, Any] = Map.empty)
 {code}
 For ML, we can store feature information like categorical/continuous, number 
 categories, category-to-index map, etc.
 One question is how to carry over the metadata in query execution. For 
 example:
 {code}
 val features = schemaRDD.select('features)
 val featuresDesc = features.schema(features).metadata
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3051) Support looking-up named accumulators in a registry

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138095#comment-14138095
 ] 

Apache Spark commented on SPARK-3051:
-

User 'nfergu' has created a pull request for this issue:
https://github.com/apache/spark/pull/2438

 Support looking-up named accumulators in a registry
 ---

 Key: SPARK-3051
 URL: https://issues.apache.org/jira/browse/SPARK-3051
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Neil Ferguson

 This is a proposed enhancement to Spark based on the following mailing list 
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html.
 This proposal builds on SPARK-2380 (Support displaying accumulator values in 
 the web UI) to allow named accumulables to be looked-up in a registry, as 
 opposed to having to be passed to every method that need to access them.
 The use case was described well by [~shivaram], as follows:
 Lets say you have two functions you use 
 in a map call and want to measure how much time each of them takes. For 
 example, if you have a code block like the one below and you want to 
 measure how much time f1 takes as a fraction of the task. 
 {noformat}
 a.map { l = 
val f = f1(l) 
... some work here ... 
 } 
 {noformat}
 It would be really cool if we could do something like 
 {noformat}
 a.map { l = 
val start = System.nanoTime 
val f = f1(l) 
TaskMetrics.get(f1-time).add(System.nanoTime - start) 
 } 
 {noformat}
 SPARK-2380 provides a partial solution to this problem -- however the 
 accumulables would still need to be passed to every function that needs them, 
 which I think would be cumbersome in any application of reasonable complexity.
 The proposal, as suggested by [~pwendell], is to have a registry of 
 accumulables, that can be looked-up by name. 
 Regarding the implementation details, I'd propose that we broadcast a 
 serialized version of all named accumulables in the DAGScheduler (similar to 
 what SPARK-2521 does for Tasks). These can then be deserialized in the 
 Executor. 
 Accumulables are already stored in thread-local variables in the Accumulators 
 object, so exposing these in the registry should be simply a matter of 
 wrapping this object, and keying the accumulables by name (they are currently 
 keyed by ID).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training

2014-09-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3161:
-
Description: 
Improvement: worker computation

When training each level of a DecisionTree, each example needs to be mapped to 
a node in the current level (or to none if it does not reach that level).  This 
is currently done via the function predictNodeIndex(), which traces from the 
current tree’s root node to the given level.

Proposal: Cache this mapping.
* Pro: O(1) lookup instead of O(level).
* Con: Extra RDD which must share the same partitioning as the training data.

Design:
* (option 1) This could be done as in [Sequoia Forests | 
https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
array of node indices (1 node per tree).
* (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
node indices to an array of instances.  This has more overhead in data 
structures but could be more efficient since not all nodes are split on each 
iteration.

  was:
Improvement: worker computation

When training each level of a DecisionTree, each example needs to be mapped to 
a node in the current level (or to none if it does not reach that level).  This 
is currently done via the function predictNodeIndex(), which traces from the 
current tree’s root node to the given level.

Proposal: Cache this mapping.
* Pro: O(1) lookup instead of O(level).
* Con: Extra RDD which must share the same partitioning as the training data.



 Cache example-node map for DecisionTree training
 

 Key: SPARK-3161
 URL: https://issues.apache.org/jira/browse/SPARK-3161
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Improvement: worker computation
 When training each level of a DecisionTree, each example needs to be mapped 
 to a node in the current level (or to none if it does not reach that level).  
 This is currently done via the function predictNodeIndex(), which traces from 
 the current tree’s root node to the given level.
 Proposal: Cache this mapping.
 * Pro: O(1) lookup instead of O(level).
 * Con: Extra RDD which must share the same partitioning as the training data.
 Design:
 * (option 1) This could be done as in [Sequoia Forests | 
 https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
 array of node indices (1 node per tree).
 * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
 Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
 node indices to an array of instances.  This has more overhead in data 
 structures but could be more efficient since not all nodes are split on each 
 iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2014-09-17 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138100#comment-14138100
 ] 

Patrick Wendell commented on SPARK-3563:


Eventually the references should be garbage collected... what happens if you 
manually trigger GC? Does it work? It's also possible there is a bug somewhere, 
or that your program is storing other references to created RDDs.

 Shuffle data not always be cleaned
 --

 Key: SPARK-3563
 URL: https://issues.apache.org/jira/browse/SPARK-3563
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: shenhong

 In our cluster, when we run a spark streaming job, after running for many 
 hours, the shuffle data seems not all be cleaned, here is the shuffle data:
 -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
 The shuffle data of stage 63, 65, 84, 132... are not cleaned.
 In ContextCleaner, it maintains a weak reference for each RDD, 
 ShuffleDependency, and Broadcast of interest,  to be processed when the 
 associated object goes out of scope of the application. Actual  cleanup is 
 performed in a separate daemon thread. 
 There must be some  reference for ShuffleDependency , and it's hard to find 
 out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3565) make code consistent with document

2014-09-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3565:
---
Component/s: (was: core)
 Spark Core

 make code consistent with document
 --

 Key: SPARK-3565
 URL: https://issues.apache.org/jira/browse/SPARK-3565
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: WangTaoTheTonic
Priority: Minor

 The configuration item represent Default number of retries in binding to a 
 port in code is spark.ports.maxRetries while spark.port.maxRetries in 
 document configuration.md. We need to make them consistent.
  
 In org.apache.spark.util.Utils.scala:
   /**
* Default number of retries in binding to a port.
*/
   val portMaxRetries: Int = {
 if (sys.props.contains(spark.testing)) {
   // Set a higher number of retries for tests...
   sys.props.get(spark.port.maxRetries).map(_.toInt).getOrElse(100)
 } else {
   Option(SparkEnv.get)
 .flatMap(_.conf.getOption(spark.port.maxRetries))
 .map(_.toInt)
 .getOrElse(16)
 }
   }
 In configuration.md:
 tr
   tdcodespark.port.maxRetries/code/td
   td16/td
   td
 Maximum number of retries when binding to a port before giving up.
   /td
 /tr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3565) make code consistent with document

2014-09-17 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138101#comment-14138101
 ] 

Patrick Wendell commented on SPARK-3565:


Please use Spark Core component and not core

 make code consistent with document
 --

 Key: SPARK-3565
 URL: https://issues.apache.org/jira/browse/SPARK-3565
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: WangTaoTheTonic
Priority: Minor

 The configuration item represent Default number of retries in binding to a 
 port in code is spark.ports.maxRetries while spark.port.maxRetries in 
 document configuration.md. We need to make them consistent.
  
 In org.apache.spark.util.Utils.scala:
   /**
* Default number of retries in binding to a port.
*/
   val portMaxRetries: Int = {
 if (sys.props.contains(spark.testing)) {
   // Set a higher number of retries for tests...
   sys.props.get(spark.port.maxRetries).map(_.toInt).getOrElse(100)
 } else {
   Option(SparkEnv.get)
 .flatMap(_.conf.getOption(spark.port.maxRetries))
 .map(_.toInt)
 .getOrElse(16)
 }
   }
 In configuration.md:
 tr
   tdcodespark.port.maxRetries/code/td
   td16/td
   td
 Maximum number of retries when binding to a port before giving up.
   /td
 /tr



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training

2014-09-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3161:
-
Description: 
Improvement: worker computation

When training each level of a DecisionTree, each example needs to be mapped to 
a node in the current level (or to none if it does not reach that level).  This 
is currently done via the function predictNodeIndex(), which traces from the 
current tree’s root node to the given level.

Proposal: Cache this mapping.
* Pro: O(1) lookup instead of O(level).
* Con: Extra RDD which must share the same partitioning as the training data.

Design:
* (option 1) This could be done as in [Sequoia Forests | 
https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
array of node indices (1 node per tree).
* (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
node indices to an array of instances.  This has more overhead in data 
structures but could be more efficient: not all nodes are split on each 
iteration, and this would allow each executor to ignore instances which are not 
used for the current node set.

  was:
Improvement: worker computation

When training each level of a DecisionTree, each example needs to be mapped to 
a node in the current level (or to none if it does not reach that level).  This 
is currently done via the function predictNodeIndex(), which traces from the 
current tree’s root node to the given level.

Proposal: Cache this mapping.
* Pro: O(1) lookup instead of O(level).
* Con: Extra RDD which must share the same partitioning as the training data.

Design:
* (option 1) This could be done as in [Sequoia Forests | 
https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
array of node indices (1 node per tree).
* (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
node indices to an array of instances.  This has more overhead in data 
structures but could be more efficient since not all nodes are split on each 
iteration.


 Cache example-node map for DecisionTree training
 

 Key: SPARK-3161
 URL: https://issues.apache.org/jira/browse/SPARK-3161
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Improvement: worker computation
 When training each level of a DecisionTree, each example needs to be mapped 
 to a node in the current level (or to none if it does not reach that level).  
 This is currently done via the function predictNodeIndex(), which traces from 
 the current tree’s root node to the given level.
 Proposal: Cache this mapping.
 * Pro: O(1) lookup instead of O(level).
 * Con: Extra RDD which must share the same partitioning as the training data.
 Design:
 * (option 1) This could be done as in [Sequoia Forests | 
 https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
 array of node indices (1 node per tree).
 * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
 Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
 node indices to an array of instances.  This has more overhead in data 
 structures but could be more efficient: not all nodes are split on each 
 iteration, and this would allow each executor to ignore instances which are 
 not used for the current node set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-901) UISuite jetty port increases under contention fails if startPort is in use

2014-09-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-901.
---
Resolution: Fixed

This is fixed by SPARK-3555 since we no longer chose a specific starting port.

 UISuite jetty port increases under contention fails if startPort is in use
 

 Key: SPARK-901
 URL: https://issues.apache.org/jira/browse/SPARK-901
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core, Web UI
Affects Versions: 0.8.0
Reporter: Mark Hamstra
Priority: Minor

 Recent change of startPort to 3030 conflicts with IANA assignment for 
 arepa-cas.  If 3030 is already in use, the UISuite fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1739) Close PR's after 30 days of inactivity

2014-09-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1739.

Resolution: Won't Fix

We've introduced a different mechanism for manual closing, so I don't think 
this is necessary.

 Close PR's after 30 days of inactivity
 --

 Key: SPARK-1739
 URL: https://issues.apache.org/jira/browse/SPARK-1739
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.2.0


 Sometimes PR's get abandoned if people aren't responsive to feedback or it 
 just falls to a lower priority. We should automatically close stale PR's in 
 order to keep the queue from growing infinitely.
 I think we just want to do this with a friendly message that says This seems 
 inactive, please re-open this if you are interested in contributing the 
 patch..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Component/s: SQL

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-09-17 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3575:
---

 Summary: Hive Schema is ignored when using convertMetastoreParquet
 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Cheng Lian


This can cause problems when for example one of the columns is defined as 
TINYINT.  A class cast exception will be thrown since the parquet table scan 
produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2014-09-17 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Target Version/s: 1.2.0

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian

 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Sean McNamara (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138130#comment-14138130
 ] 

Sean McNamara edited comment on SPARK-3561 at 9/17/14 10:45 PM:


We have some workload use-cases and this would potentially be very helpful in 
lieu of SPARK-3174.


was (Author: seanmcn):
We have some workload use-cases and would potentially be very helpful in lieu 
of SPARK-3174.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-17 Thread Sean McNamara (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138130#comment-14138130
 ] 

Sean McNamara edited comment on SPARK-3561 at 9/17/14 10:47 PM:


We have some workload use-cases and this would potentially be very helpful, 
especially in lieu of SPARK-3174.


was (Author: seanmcn):
We have some workload use-cases and this would potentially be very helpful in 
lieu of SPARK-3174.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3576) Provide script for creating the Spark AMI from scratch

2014-09-17 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3576:
--

 Summary: Provide script for creating the Spark AMI from scratch
 Key: SPARK-3576
 URL: https://issues.apache.org/jira/browse/SPARK-3576
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Patrick Wendell
Assignee: Patrick Wendell






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3577) Shuffle write time incorrect for sort-based shuffle

2014-09-17 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-3577:
-

 Summary: Shuffle write time incorrect for sort-based shuffle
 Key: SPARK-3577
 URL: https://issues.apache.org/jira/browse/SPARK-3577
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kay Ousterhout


After this change 
https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 
(cc [~sandyr] [~pwendell]) the ExternalSorter passes its own 
ShuffleWriteMetrics into ExternalSorter.  The write time recorded in those 
metrics is never used -- meaning that when someone is using sort-based shuffle, 
the shuffle write time will be recorded as 0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3270) Spark API for Application Extensions

2014-09-17 Thread Michal Malohlava (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Malohlava updated SPARK-3270:

Description: 
Any application should be able to enrich spark infrastructure by services which 
are not available by default.  

Hence, to support such application extensions (aka extesions/plugins) Spark 
platform should provide:
  - an API to register an extension 
  - an API to register a service (meaning provided functionality)
  - well-defined points in Spark infrastructure which can be enriched/hooked by 
extension
  - a way of deploying extension (for example, simply putting the extension on 
classpath and using Java service interface)
  - a way to access extension from application

Overall proposal is available here: 
https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing

Note: In this context, I do not mean reinventing OSGi (or another plugin 
platform) but it can serve as a good starting point.


  was:
At the begining, let's clarify my motivation - I would like to extend Spark 
platform by an embedded application (e.g., monitoring network performance in 
the context of selected applications) which will be launched on particular 
nodes in cluster with their launch.
Nevertheless, I do not want to modify Spark code directly and hardcode my code 
in, but I would prefer to provide a jar which would be registered and launched 
by Spark itself. 

Hence, to support such 3rd party applications (aka extesions/plugins) Spark 
platform should provide at least:
  - an API to register an extension 
  - an API to register a service (meaning provided functionality)
  - well-defined points in Spark infrastructure which can be enriched/hooked by 
extension
 - in master/worker lifecycle
 - in applications lifecycle
 - in RDDs lifecycle
 - monitoring/reporting
 - ...
  - a way of deploying extension (for example, simply putting the extension on 
classpath and using Java service interface)

In this context, I do not mean reinventing OSGi (or another plugin platform) 
but it can serve as a good starting point.




 Spark API for Application Extensions
 

 Key: SPARK-3270
 URL: https://issues.apache.org/jira/browse/SPARK-3270
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Michal Malohlava

 Any application should be able to enrich spark infrastructure by services 
 which are not available by default.  
 Hence, to support such application extensions (aka extesions/plugins) 
 Spark platform should provide:
   - an API to register an extension 
   - an API to register a service (meaning provided functionality)
   - well-defined points in Spark infrastructure which can be enriched/hooked 
 by extension
   - a way of deploying extension (for example, simply putting the extension 
 on classpath and using Java service interface)
   - a way to access extension from application
 Overall proposal is available here: 
 https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing
 Note: In this context, I do not mean reinventing OSGi (or another plugin 
 platform) but it can serve as a good starting point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3577) Shuffle write time incorrect for sort-based shuffle

2014-09-17 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-3577:
--
 Description: 
After this change 
https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 
(cc [~sandyr] [~pwendell]) the ExternalSorter passes its own 
ShuffleWriteMetrics into ExternalSorter.  The write time recorded in those 
metrics is never used -- meaning that when someone is using sort-based shuffle, 
the shuffle write time will be recorded as 0.

[~sandyr] do you have time to take a look at this?

  was:After this change 
https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 
(cc [~sandyr] [~pwendell]) the ExternalSorter passes its own 
ShuffleWriteMetrics into ExternalSorter.  The write time recorded in those 
metrics is never used -- meaning that when someone is using sort-based shuffle, 
the shuffle write time will be recorded as 0.

Priority: Blocker  (was: Major)
Target Version/s: 1.2.0

 Shuffle write time incorrect for sort-based shuffle
 ---

 Key: SPARK-3577
 URL: https://issues.apache.org/jira/browse/SPARK-3577
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kay Ousterhout
Priority: Blocker

 After this change 
 https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470
  (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own 
 ShuffleWriteMetrics into ExternalSorter.  The write time recorded in those 
 metrics is never used -- meaning that when someone is using sort-based 
 shuffle, the shuffle write time will be recorded as 0.
 [~sandyr] do you have time to take a look at this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result

2014-09-17 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-3578:
-

 Summary: GraphGenerators.sampleLogNormal sometimes returns 
too-large result
 Key: SPARK-3578
 URL: https://issues.apache.org/jira/browse/SPARK-3578
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor


GraphGenerators.sampleLogNormal is supposed to return an integer strictly less 
than maxVal. However, it violates this guarantee. It generates its return value 
as follows:

{code}
var X: Double = maxVal

while (X = maxVal) {
  val Z = rand.nextGaussian()
  X = math.exp(mu + sigma*Z)
}
math.round(X.toFloat)
{code}

When X is sampled to be close to (but less than) maxVal, then it will pass the 
while loop condition, but the rounded result will be equal to maxVal, which 
will fail the test.

For example, if maxVal is 5 and X is 4.9, then X  maxVal, but 
math.round(X.toFloat) is 5.

A solution is to truncate X instead of rounding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result

2014-09-17 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3578:
--
Description: 
GraphGenerators.sampleLogNormal is supposed to return an integer strictly less 
than maxVal. However, it violates this guarantee. It generates its return value 
as follows:

{code}
var X: Double = maxVal

while (X = maxVal) {
  val Z = rand.nextGaussian()
  X = math.exp(mu + sigma*Z)
}
math.round(X.toFloat)
{code}

When X is sampled to be close to (but less than) maxVal, then it will pass the 
while loop condition, but the rounded result will be equal to maxVal, which 
will fail the test.

For example, if maxVal is 5 and X is 4.9, then X  maxVal, but 
math.round(X.toFloat) is 5.

A solution is to round X down instead of to the nearest integer.

  was:
GraphGenerators.sampleLogNormal is supposed to return an integer strictly less 
than maxVal. However, it violates this guarantee. It generates its return value 
as follows:

{code}
var X: Double = maxVal

while (X = maxVal) {
  val Z = rand.nextGaussian()
  X = math.exp(mu + sigma*Z)
}
math.round(X.toFloat)
{code}

When X is sampled to be close to (but less than) maxVal, then it will pass the 
while loop condition, but the rounded result will be equal to maxVal, which 
will fail the test.

For example, if maxVal is 5 and X is 4.9, then X  maxVal, but 
math.round(X.toFloat) is 5.

A solution is to truncate X instead of rounding it.


 GraphGenerators.sampleLogNormal sometimes returns too-large result
 --

 Key: SPARK-3578
 URL: https://issues.apache.org/jira/browse/SPARK-3578
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor

 GraphGenerators.sampleLogNormal is supposed to return an integer strictly 
 less than maxVal. However, it violates this guarantee. It generates its 
 return value as follows:
 {code}
 var X: Double = maxVal
 while (X = maxVal) {
   val Z = rand.nextGaussian()
   X = math.exp(mu + sigma*Z)
 }
 math.round(X.toFloat)
 {code}
 When X is sampled to be close to (but less than) maxVal, then it will pass 
 the while loop condition, but the rounded result will be equal to maxVal, 
 which will fail the test.
 For example, if maxVal is 5 and X is 4.9, then X  maxVal, but 
 math.round(X.toFloat) is 5.
 A solution is to round X down instead of to the nearest integer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result

2014-09-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138188#comment-14138188
 ] 

Apache Spark commented on SPARK-3578:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/2439

 GraphGenerators.sampleLogNormal sometimes returns too-large result
 --

 Key: SPARK-3578
 URL: https://issues.apache.org/jira/browse/SPARK-3578
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor

 GraphGenerators.sampleLogNormal is supposed to return an integer strictly 
 less than maxVal. However, it violates this guarantee. It generates its 
 return value as follows:
 {code}
 var X: Double = maxVal
 while (X = maxVal) {
   val Z = rand.nextGaussian()
   X = math.exp(mu + sigma*Z)
 }
 math.round(X.toFloat)
 {code}
 When X is sampled to be close to (but less than) maxVal, then it will pass 
 the while loop condition, but the rounded result will be equal to maxVal, 
 which will fail the test.
 For example, if maxVal is 5 and X is 4.9, then X  maxVal, but 
 math.round(X.toFloat) is 5.
 A solution is to round X down instead of to the nearest integer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3571) Spark standalone cluster mode doesn't work.

2014-09-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3571.

   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2436
[https://github.com/apache/spark/pull/2436]

 Spark standalone cluster mode doesn't work.
 ---

 Key: SPARK-3571
 URL: https://issues.apache.org/jira/browse/SPARK-3571
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Blocker
 Fix For: 1.2.0


 Recent changes of Master.scala causes Spark standalone cluster mode not 
 working.
 I think, the loop in Master#schedule never assign worker for driver.
 {code}
 for (driver - waitingDrivers.toList) { // iterate over a copy of 
 waitingDrivers
   // We assign workers to each waiting driver in a round-robin fashion. 
 For each driver, we
   // start from the last worker that was assigned a driver, and continue 
 onwards until we have
   // explored all alive workers.
   curPos = (curPos + 1) % aliveWorkerNum
   val startPos = curPos
   var launched = false
   while (curPos != startPos  !launched) {
 val worker = shuffledAliveWorkers(curPos)
 if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
   launchDriver(worker, driver)
   waitingDrivers -= driver
   launched = true
 }
 curPos = (curPos + 1) % aliveWorkerNum
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3571) Spark standalone cluster mode doesn't work.

2014-09-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3571:
-
Assignee: Kousuke Saruta

 Spark standalone cluster mode doesn't work.
 ---

 Key: SPARK-3571
 URL: https://issues.apache.org/jira/browse/SPARK-3571
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Blocker
 Fix For: 1.2.0


 Recent changes of Master.scala causes Spark standalone cluster mode not 
 working.
 I think, the loop in Master#schedule never assign worker for driver.
 {code}
 for (driver - waitingDrivers.toList) { // iterate over a copy of 
 waitingDrivers
   // We assign workers to each waiting driver in a round-robin fashion. 
 For each driver, we
   // start from the last worker that was assigned a driver, and continue 
 onwards until we have
   // explored all alive workers.
   curPos = (curPos + 1) % aliveWorkerNum
   val startPos = curPos
   var launched = false
   while (curPos != startPos  !launched) {
 val worker = shuffledAliveWorkers(curPos)
 if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
   launchDriver(worker, driver)
   waitingDrivers -= driver
   launched = true
 }
 curPos = (curPos + 1) % aliveWorkerNum
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3564) Display App ID on HistoryPage

2014-09-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3564:
-
Assignee: Kousuke Saruta

 Display App ID on HistoryPage
 -

 Key: SPARK-3564
 URL: https://issues.apache.org/jira/browse/SPARK-3564
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.1.1, 1.2.0


 Current HistoryPage display doesn't display App ID so if there are lots of 
 applications which have same name, it's difficult to find an application we'd 
 like to know it's status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3564) Display App ID on HistoryPage

2014-09-17 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3564.

   Resolution: Fixed
Fix Version/s: 1.1.1
   1.2.0

Issue resolved by pull request 2424
[https://github.com/apache/spark/pull/2424]

 Display App ID on HistoryPage
 -

 Key: SPARK-3564
 URL: https://issues.apache.org/jira/browse/SPARK-3564
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
 Fix For: 1.2.0, 1.1.1


 Current HistoryPage display doesn't display App ID so if there are lots of 
 applications which have same name, it's difficult to find an application we'd 
 like to know it's status.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile

2014-09-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3567:
-
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

 appId field in SparkDeploySchedulerBackend should be volatile
 -

 Key: SPARK-3567
 URL: https://issues.apache.org/jira/browse/SPARK-3567
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
 Fix For: 1.2.0


 appId field in SparkDeploySchedulerBackend is set by 
 AppClient.ClientActor#receiveWithLogging and appId is referred via 
 SparkDeploySchedulerBackend#applicationId.
 A thread which runs AppClient.ClientActor and a thread invoking 
 SparkDeploySchedulerBackend#applicationId can be another threads so appId 
 should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile

2014-09-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3567:
-
Fix Version/s: 1.2.0

 appId field in SparkDeploySchedulerBackend should be volatile
 -

 Key: SPARK-3567
 URL: https://issues.apache.org/jira/browse/SPARK-3567
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
 Fix For: 1.2.0


 appId field in SparkDeploySchedulerBackend is set by 
 AppClient.ClientActor#receiveWithLogging and appId is referred via 
 SparkDeploySchedulerBackend#applicationId.
 A thread which runs AppClient.ClientActor and a thread invoking 
 SparkDeploySchedulerBackend#applicationId can be another threads so appId 
 should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile

2014-09-17 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3567:
-
Assignee: Kousuke Saruta

 appId field in SparkDeploySchedulerBackend should be volatile
 -

 Key: SPARK-3567
 URL: https://issues.apache.org/jira/browse/SPARK-3567
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.2.0


 appId field in SparkDeploySchedulerBackend is set by 
 AppClient.ClientActor#receiveWithLogging and appId is referred via 
 SparkDeploySchedulerBackend#applicationId.
 A thread which runs AppClient.ClientActor and a thread invoking 
 SparkDeploySchedulerBackend#applicationId can be another threads so appId 
 should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3579) Jekyll doc generation is different across environments

2014-09-17 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3579:
--

 Summary: Jekyll doc generation is different across environments
 Key: SPARK-3579
 URL: https://issues.apache.org/jira/browse/SPARK-3579
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Wendell


This can result in a lot of false changes when someone alters something with 
the docs. It only is relevant to the website subversion repo, but the fix might 
need to go into the Spark codebase.

There are at least two issues here. One is that the HTML character escaping can 
be different in certain cases. Another is that the highlighting output seems a 
bit different depending on (I think) what version of pygments is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

2014-09-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138242#comment-14138242
 ] 

Josh Rosen commented on SPARK-2321:
---

I agree that this should be a pull API.  A pull-based API will be easier to 
expose in Python and Java and might be sufficient for many of the common 
use-cases.

With a push-based API, we might have to worry about things like callback 
processing speed, rate-limiting, etc. (the Rx frameworks have interesting ways 
of addressing these concerns, but I'm not sure whether we want to add those 
dependencies); the pull approach won't face these issues.

 Design a proper progress reporting  event listener API
 ---

 Key: SPARK-2321
 URL: https://issues.apache.org/jira/browse/SPARK-2321
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Josh Rosen
Priority: Critical

 This is a ticket to track progress on redesigning the SparkListener and 
 JobProgressListener API.
 There are multiple problems with the current design, including:
 0. I'm not sure if the API is usable in Java (there are at least some enums 
 we used in Scala and a bunch of case classes that might complicate things).
 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
 attention to it yet. Something as important as progress reporting deserves a 
 more stable API.
 2. There is no easy way to connect jobs with stages. Similarly, there is no 
 easy way to connect job groups with jobs / stages.
 3. JobProgressListener itself has no encapsulation at all. States can be 
 arbitrarily mutated by external programs. Variable names are sort of randomly 
 decided and inconsistent. 
 We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3574) Shuffle finish time always reported as -1

2014-09-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138246#comment-14138246
 ] 

Sandy Ryza commented on SPARK-3574:
---

On it

 Shuffle finish time always reported as -1
 -

 Key: SPARK-3574
 URL: https://issues.apache.org/jira/browse/SPARK-3574
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Kay Ousterhout

 shuffleFinishTime is always reported as -1.  I think the way we fix this 
 should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the 
 shuffles finish, but when aggregating the metrics, only report 
 shuffleFinishTime as something other than -1 when *all* of the shuffles have 
 completed.
 [~sandyr], it looks like this was introduced in your recent patch to 
 incrementally report metrics.  Any chance you can fix this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3577) Shuffle write time incorrect for sort-based shuffle

2014-09-17 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138245#comment-14138245
 ] 

Sandy Ryza commented on SPARK-3577:
---

On it

 Shuffle write time incorrect for sort-based shuffle
 ---

 Key: SPARK-3577
 URL: https://issues.apache.org/jira/browse/SPARK-3577
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kay Ousterhout
Priority: Blocker

 After this change 
 https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470
  (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own 
 ShuffleWriteMetrics into ExternalSorter.  The write time recorded in those 
 metrics is never used -- meaning that when someone is using sort-based 
 shuffle, the shuffle write time will be recorded as 0.
 [~sandyr] do you have time to take a look at this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3574) Shuffle finish time always reported as -1

2014-09-17 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138247#comment-14138247
 ] 

Kay Ousterhout commented on SPARK-3574:
---

Thanks Sandy!!

 Shuffle finish time always reported as -1
 -

 Key: SPARK-3574
 URL: https://issues.apache.org/jira/browse/SPARK-3574
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Kay Ousterhout

 shuffleFinishTime is always reported as -1.  I think the way we fix this 
 should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the 
 shuffles finish, but when aggregating the metrics, only report 
 shuffleFinishTime as something other than -1 when *all* of the shuffles have 
 completed.
 [~sandyr], it looks like this was introduced in your recent patch to 
 incrementally report metrics.  Any chance you can fix this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-17 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138281#comment-14138281
 ] 

Matei Zaharia commented on SPARK-3129:
--

Great, it will be nice to see how fast this is. I also think the rate per node 
doesn't need to be enormous for this to be useful, since we can also 
parallelize receiving over multiple nodes.

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3580) Add Consistent Method For Number of RDD Partitions Across Differnet Languages

2014-09-17 Thread Pat McDonough (JIRA)
Pat McDonough created SPARK-3580:


 Summary: Add Consistent Method For Number of RDD Partitions Across 
Differnet Languages
 Key: SPARK-3580
 URL: https://issues.apache.org/jira/browse/SPARK-3580
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Pat McDonough


Programmatically retrieving the number of partitions is not consistent between 
python and scala. A consistent method should be defined and made public across 
both languages.

RDD.partitions.size is also used quite frequently throughout the internal code, 
so that might be worth refactoring as well once the new method is available.

What we have today is below.

In Scala:
{code}
scala someRDD.partitions.size
res0: Int = 30
{code}

In Python:
{code}
In [2]: someRDD.getNumPartitions()
Out[2]: 30
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Differnet Languages

2014-09-17 Thread Pat McDonough (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat McDonough updated SPARK-3580:
-
Summary: Add Consistent Method To Get Number of RDD Partitions Across 
Differnet Languages  (was: Add Consistent Method For Number of RDD Partitions 
Across Differnet Languages)

 Add Consistent Method To Get Number of RDD Partitions Across Differnet 
 Languages
 

 Key: SPARK-3580
 URL: https://issues.apache.org/jira/browse/SPARK-3580
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Pat McDonough

 Programmatically retrieving the number of partitions is not consistent 
 between python and scala. A consistent method should be defined and made 
 public across both languages.
 RDD.partitions.size is also used quite frequently throughout the internal 
 code, so that might be worth refactoring as well once the new method is 
 available.
 What we have today is below.
 In Scala:
 {code}
 scala someRDD.partitions.size
 res0: Int = 30
 {code}
 In Python:
 {code}
 In [2]: someRDD.getNumPartitions()
 Out[2]: 30
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages

2014-09-17 Thread Pat McDonough (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat McDonough updated SPARK-3580:
-
Summary: Add Consistent Method To Get Number of RDD Partitions Across 
Different Languages  (was: Add Consistent Method To Get Number of RDD 
Partitions Across Differnet Languages)

 Add Consistent Method To Get Number of RDD Partitions Across Different 
 Languages
 

 Key: SPARK-3580
 URL: https://issues.apache.org/jira/browse/SPARK-3580
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Pat McDonough

 Programmatically retrieving the number of partitions is not consistent 
 between python and scala. A consistent method should be defined and made 
 public across both languages.
 RDD.partitions.size is also used quite frequently throughout the internal 
 code, so that might be worth refactoring as well once the new method is 
 available.
 What we have today is below.
 In Scala:
 {code}
 scala someRDD.partitions.size
 res0: Int = 30
 {code}
 In Python:
 {code}
 In [2]: someRDD.getNumPartitions()
 Out[2]: 30
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >