[jira] [Updated] (SPARK-2953) Allow using short names for io compression codecs

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2953:
---

Issue Type: Improvement  (was: Bug)

 Allow using short names for io compression codecs
 -

 Key: SPARK-2953
 URL: https://issues.apache.org/jira/browse/SPARK-2953
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Instead of requiring org.apache.spark.io.LZ4CompressionCodec, it is easier 
 to just accept lz4, lzf, snappy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2953) Allow using short names for io compression codecs

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2953:
---

Component/s: Spark Core

 Allow using short names for io compression codecs
 -

 Key: SPARK-2953
 URL: https://issues.apache.org/jira/browse/SPARK-2953
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Instead of requiring org.apache.spark.io.LZ4CompressionCodec, it is easier 
 to just accept lz4, lzf, snappy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2953) Allow using short names for io compression codecs

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2953:
---

Description: Instead of requiring 
org.apache.spark.io.LZ4CompressionCodec, it is easier for users if Spark just 
accepts lz4, lzf, snappy.  (was: Instead of requiring 
org.apache.spark.io.LZ4CompressionCodec, it is easier to just accept lz4, 
lzf, snappy.)

 Allow using short names for io compression codecs
 -

 Key: SPARK-2953
 URL: https://issues.apache.org/jira/browse/SPARK-2953
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Instead of requiring org.apache.spark.io.LZ4CompressionCodec, it is easier 
 for users if Spark just accepts lz4, lzf, snappy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2953) Allow using short names for io compression codecs

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092034#comment-14092034
 ] 

Apache Spark commented on SPARK-2953:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1873

 Allow using short names for io compression codecs
 -

 Key: SPARK-2953
 URL: https://issues.apache.org/jira/browse/SPARK-2953
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 Instead of requiring org.apache.spark.io.LZ4CompressionCodec, it is easier 
 for users if Spark just accepts lz4, lzf, snappy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2947) DAGScheduler scheduling infinite loop

2014-08-10 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Summary: DAGScheduler scheduling infinite loop  (was: DAGScheduler 
scheduling dead cycle)

 DAGScheduler scheduling infinite loop
 -

 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.0, 1.0.3


 Stage to resubmit more than 5 times.
 This seems to be caused by {{FetchFailed.bmAddress}} is null .
 I don't know how to reproduce it.
 master log:
 {noformat}
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
 TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
 TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
 1.189:141)
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
  -- 5 times ---
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 1.189, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
 ms on jilin (progress: 280/280)
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
 269)
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 2.1, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
 DealCF.scala:207) finished in 129.544 s
 {noformat}
 worker: log
 {noformat}
 /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18017
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18285
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18419
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 

[jira] [Created] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6

2014-08-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2954:
-

 Summary: PySpark MLlib serialization tests fail on Python 2.6
 Key: SPARK-2954
 URL: https://issues.apache.org/jira/browse/SPARK-2954
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Josh Rosen


The PySpark MLlib tests currently fail on Python 2.6 due to problems unpacking 
data from bytearray using struct.unpack:

{code}
**
File pyspark/mllib/_common.py, line 181, in __main__._deserialize_double
Failed example:
_deserialize_double(_serialize_double(1L)) == 1.0
Exception raised:
Traceback (most recent call last):
  File 
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
 line 1253, in __run
compileflags, 1) in test.globs
  File doctest __main__._deserialize_double[4], line 1, in module
_deserialize_double(_serialize_double(1L)) == 1.0
  File pyspark/mllib/_common.py, line 194, in _deserialize_double
return struct.unpack(d, ba[offset:])[0]
error: unpack requires a string argument of length 8
**
File pyspark/mllib/_common.py, line 184, in __main__._deserialize_double
Failed example:
_deserialize_double(_serialize_double(sys.float_info.max)) == x
Exception raised:
Traceback (most recent call last):
  File 
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
 line 1253, in __run
compileflags, 1) in test.globs
  File doctest __main__._deserialize_double[6], line 1, in module
_deserialize_double(_serialize_double(sys.float_info.max)) == x
  File pyspark/mllib/_common.py, line 194, in _deserialize_double
return struct.unpack(d, ba[offset:])[0]
error: unpack requires a string argument of length 8
**
File pyspark/mllib/_common.py, line 187, in __main__._deserialize_double
Failed example:
_deserialize_double(_serialize_double(sys.float_info.max)) == y
Exception raised:
Traceback (most recent call last):
  File 
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
 line 1253, in __run
compileflags, 1) in test.globs
  File doctest __main__._deserialize_double[8], line 1, in module
_deserialize_double(_serialize_double(sys.float_info.max)) == y
  File pyspark/mllib/_common.py, line 194, in _deserialize_double
return struct.unpack(d, ba[offset:])[0]
error: unpack requires a string argument of length 8
**
{code}

It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: 
http://stackoverflow.com/a/15467046/590203



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6

2014-08-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-2954:
-

Assignee: Josh Rosen

 PySpark MLlib serialization tests fail on Python 2.6
 

 Key: SPARK-2954
 URL: https://issues.apache.org/jira/browse/SPARK-2954
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 The PySpark MLlib tests currently fail on Python 2.6 due to problems 
 unpacking data from bytearray using struct.unpack:
 {code}
 **
 File pyspark/mllib/_common.py, line 181, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(1L)) == 1.0
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[4], line 1, in module
 _deserialize_double(_serialize_double(1L)) == 1.0
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 184, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[6], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 187, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[8], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 {code}
 It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: 
 http://stackoverflow.com/a/15467046/590203



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6

2014-08-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2954:
--

Component/s: PySpark

 PySpark MLlib serialization tests fail on Python 2.6
 

 Key: SPARK-2954
 URL: https://issues.apache.org/jira/browse/SPARK-2954
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Josh Rosen

 The PySpark MLlib tests currently fail on Python 2.6 due to problems 
 unpacking data from bytearray using struct.unpack:
 {code}
 **
 File pyspark/mllib/_common.py, line 181, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(1L)) == 1.0
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[4], line 1, in module
 _deserialize_double(_serialize_double(1L)) == 1.0
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 184, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[6], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 187, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[8], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 {code}
 It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: 
 http://stackoverflow.com/a/15467046/590203



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092044#comment-14092044
 ] 

Apache Spark commented on SPARK-2954:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1874

 PySpark MLlib serialization tests fail on Python 2.6
 

 Key: SPARK-2954
 URL: https://issues.apache.org/jira/browse/SPARK-2954
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 The PySpark MLlib tests currently fail on Python 2.6 due to problems 
 unpacking data from bytearray using struct.unpack:
 {code}
 **
 File pyspark/mllib/_common.py, line 181, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(1L)) == 1.0
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[4], line 1, in module
 _deserialize_double(_serialize_double(1L)) == 1.0
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 184, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[6], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == x
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 File pyspark/mllib/_common.py, line 187, in __main__._deserialize_double
 Failed example:
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
 Exception raised:
 Traceback (most recent call last):
   File 
 /System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py,
  line 1253, in __run
 compileflags, 1) in test.globs
   File doctest __main__._deserialize_double[8], line 1, in module
 _deserialize_double(_serialize_double(sys.float_info.max)) == y
   File pyspark/mllib/_common.py, line 194, in _deserialize_double
 return struct.unpack(d, ba[offset:])[0]
 error: unpack requires a string argument of length 8
 **
 {code}
 It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: 
 http://stackoverflow.com/a/15467046/590203



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092045#comment-14092045
 ] 

Apache Spark commented on SPARK-2948:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1874

 PySpark doesn't work on Python 2.6
 --

 Key: SPARK-2948
 URL: https://issues.apache.org/jira/browse/SPARK-2948
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: CentOS 6.5 / Python 2.6.6
Reporter: Kousuke Saruta
Priority: Blocker

 In serializser.py, collections.namedtuple is redefined as follows.
 {code}
 def namedtuple(name, fields, verbose=False, rename=False):
   
   
 cls = _old_namedtuple(name, fields, verbose, rename)  
   
   
 return _hack_namedtuple(cls)  
   
   
  
 {code}
 The number of arguments is 4 but the number of arguments of namedtuple for 
 Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-2948:
-

Assignee: Josh Rosen

 PySpark doesn't work on Python 2.6
 --

 Key: SPARK-2948
 URL: https://issues.apache.org/jira/browse/SPARK-2948
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: CentOS 6.5 / Python 2.6.6
Reporter: Kousuke Saruta
Assignee: Josh Rosen
Priority: Blocker

 In serializser.py, collections.namedtuple is redefined as follows.
 {code}
 def namedtuple(name, fields, verbose=False, rename=False):
   
   
 cls = _old_namedtuple(name, fields, verbose, rename)  
   
   
 return _hack_namedtuple(cls)  
   
   
  
 {code}
 The number of arguments is 4 but the number of arguments of namedtuple for 
 Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2101) Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092048#comment-14092048
 ] 

Apache Spark commented on SPARK-2101:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1874

 Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()
 -

 Key: SPARK-2101
 URL: https://issues.apache.org/jira/browse/SPARK-2101
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Uri Laserson
Assignee: Josh Rosen

 PySpark tests fail with Python 2.6 because they currently depend on 
 {{unittest.skipIf}}, which was only introduced in Python 2.7.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2910) Test with Python 2.6 on Jenkins

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092046#comment-14092046
 ] 

Apache Spark commented on SPARK-2910:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1874

 Test with Python 2.6 on Jenkins
 ---

 Key: SPARK-2910
 URL: https://issues.apache.org/jira/browse/SPARK-2910
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, PySpark
Reporter: Josh Rosen

 As long as we continue to support Python 2.6 in PySpark, Jenkins should test  
 with Python 2.6.
 We could downgrade the system Python to 2.6, but it might be easier / cleaner 
 to install 2.6 alongside the current Python and {{export 
 PYSPARK_PYTHON=python2.6}} in the test runner script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2910) Test with Python 2.6 on Jenkins

2014-08-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-2910:
-

Assignee: Josh Rosen

 Test with Python 2.6 on Jenkins
 ---

 Key: SPARK-2910
 URL: https://issues.apache.org/jira/browse/SPARK-2910
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, PySpark
Reporter: Josh Rosen
Assignee: Josh Rosen

 As long as we continue to support Python 2.6 in PySpark, Jenkins should test  
 with Python 2.6.
 We could downgrade the system Python to 2.6, but it might be easier / cleaner 
 to install 2.6 alongside the current Python and {{export 
 PYSPARK_PYTHON=python2.6}} in the test runner script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-10 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092058#comment-14092058
 ] 

Shay Rojansky commented on SPARK-2945:
--

I just did a quick test on Spark 1.0.2, and spark.executor.instances does 
indeed appear to control the number of executors allocated (at least in YARN).

Should I keep this open for you guys to take a look and update the docs?

 Allow specifying num of executors in the context configuration
 --

 Key: SPARK-2945
 URL: https://issues.apache.org/jira/browse/SPARK-2945
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky

 Running on YARN, the only way to specify the number of executors seems to be 
 on the command line of spark-submit, via the --num-executors switch.
 In many cases this is too early. Our Spark app receives some cmdline 
 arguments which determine the amount of work that needs to be done - and that 
 affects the number of executors it ideally requires. Ideally, the Spark 
 context configuration would support specifying this like any other config 
 param.
 Our current workaround is a wrapper script that determines how much work is 
 needed, and which itself launches spark-submit with the number passed to 
 --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2947) DAGScheduler resubmit the stage into an infinite loop

2014-08-10 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Summary: DAGScheduler resubmit the stage into an infinite loop  (was: 
DAGScheduler resubmit the task into an infinite loop)

 DAGScheduler resubmit the stage into an infinite loop
 -

 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.0, 1.0.3


 Stage to resubmit more than 5 times.
 This seems to be caused by {{FetchFailed.bmAddress}} is null .
 I don't know how to reproduce it.
 master log:
 {noformat}
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
 TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
 TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
 1.189:141)
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
  -- 5 times ---
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 1.189, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
 ms on jilin (progress: 280/280)
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
 269)
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 2.1, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
 DealCF.scala:207) finished in 129.544 s
 {noformat}
 worker: log
 {noformat}
 /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18017
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18285
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18419
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO 

[jira] [Updated] (SPARK-2947) DAGScheduler resubmit the task into an infinite loop

2014-08-10 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Summary: DAGScheduler resubmit the task into an infinite loop  (was: 
DAGScheduler scheduling infinite loop)

 DAGScheduler resubmit the task into an infinite loop
 

 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.0, 1.0.3


 Stage to resubmit more than 5 times.
 This seems to be caused by {{FetchFailed.bmAddress}} is null .
 I don't know how to reproduce it.
 master log:
 {noformat}
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
 TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
 TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
 1.189:141)
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
  -- 5 times ---
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 1.189, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
 ms on jilin (progress: 280/280)
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
 269)
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 2.1, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
 DealCF.scala:207) finished in 129.544 s
 {noformat}
 worker: log
 {noformat}
 /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18017
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18285
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18419
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO 

[jira] [Commented] (SPARK-2947) DAGScheduler resubmit the stage into an infinite loop

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092100#comment-14092100
 ] 

Apache Spark commented on SPARK-2947:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/1877

 DAGScheduler resubmit the stage into an infinite loop
 -

 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Guoqiang Li
Priority: Blocker
 Fix For: 1.1.0, 1.0.3


 Stage to resubmit more than 5 times.
 This seems to be caused by {{FetchFailed.bmAddress}} is null .
 I don't know how to reproduce it.
 master log:
 {noformat}
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
 TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
 TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
 3060 bytes in 0 ms
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
 1.189:141)
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
 failure from null
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
  -- 5 times ---
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
 DealCF.scala:215) for resubmision due to a fetch failure
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
 Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 1.189, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
 ms on jilin (progress: 280/280)
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
 269)
 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
 2.1, whose tasks have all completed, from pool 
 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
 DealCF.scala:207) finished in 129.544 s
 {noformat}
 worker: log
 {noformat}
 /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18017
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
 computing it
 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
 computing it
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18285
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 18419
 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 

[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-10 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: spark-1297-v2.txt

Tentative patch adds hbase-hadoop2 profile.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-10 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: (was: spark-1297-v2.txt)

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor

 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-10 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: spark-1297-v2.txt

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092115#comment-14092115
 ] 

Sean Owen commented on SPARK-1297:
--

This doesn't work with Hadoop 1 though. It also requires turning on an HBase 
profile for every build. See my comments above; I think this can be made 
friendlier with more work in the profiles. I think it requires a hadoop1 
profile to really solve this kind of problem for every components, not just 
HBase.

 Upgrade HBase dependency to 0.98.0
 --

 Key: SPARK-1297
 URL: https://issues.apache.org/jira/browse/SPARK-1297
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor
 Attachments: spark-1297-v2.txt


 HBase 0.94.6 was released 11 months ago.
 Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-10 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092135#comment-14092135
 ] 

Xiangrui Meng commented on SPARK-2944:
--

Found that this behavior is not deterministic. So it is hard to tell which 
commit introduces it now. It seems that it happens when tasks are very small. 
Some workers may get a lot more assignments than others because they finishes 
the tasks very quickly and TaskSetManager always picks the first available one. 
(There are no randomization in `TaskSetManager`.)

 sc.makeRDD doesn't distribute partitions evenly
 ---

 Key: SPARK-2944
 URL: https://issues.apache.org/jira/browse/SPARK-2944
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 16 nodes EC2 cluster:
 {code}
 val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
 rdd.count()
 {code}
 Saw 156 partitions on one node while only 8 partitions on another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output

2014-08-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-2950:
-

Fix Version/s: 1.2.0

 Add GC time and Shuffle Write time to JobLogger output
 --

 Key: SPARK-2950
 URL: https://issues.apache.org/jira/browse/SPARK-2950
 Project: Spark
  Issue Type: Improvement
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 The JobLogger is very useful for performing offline performance profiling of 
 Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but 
 are currently missed from the JobLogger output. This change adds these two 
 fields.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output

2014-08-10 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-2950.
--

Resolution: Fixed

 Add GC time and Shuffle Write time to JobLogger output
 --

 Key: SPARK-2950
 URL: https://issues.apache.org/jira/browse/SPARK-2950
 Project: Spark
  Issue Type: Improvement
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor
 Fix For: 1.2.0


 The JobLogger is very useful for performing offline performance profiling of 
 Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but 
 are currently missed from the JobLogger output. This change adds these two 
 fields.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2898) Failed to connect to daemon

2014-08-10 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2898.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 Failed to connect to daemon
 ---

 Key: SPARK-2898
 URL: https://issues.apache.org/jira/browse/SPARK-2898
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.1.0


 There is a deadlock  in handle_sigchld() because of logging
 
 Java options: -Dspark.storage.memoryFraction=0.66 
 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
 -Dspark.executor.memory=3g -Dspark.locality.wait=6000
 Options: SchedulerThroughputTest --num-tasks=1 --num-trials=4 
 --inter-trial-wait=1
 
 14/08/06 22:09:41 WARN JettyUtils: Failed to create UI on port 4040. Trying 
 again on port 4041. - Failure(java.net.BindException: Address already in use)
 worker 50114 crashed abruptly with exit status 1
 14/08/06 22:10:37 ERROR Executor: Exception in task 1476.0 in stage 1.0 (TID 
 11476)
 org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:150)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:101)
   ... 10 more
 14/08/06 22:10:37 WARN PythonWorkerFactory: Failed to open socket to Python 
 daemon:
 java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at 
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
   at 
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
   at 
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
   at java.net.Socket.connect(Socket.java:579)
   at java.net.Socket.connect(Socket.java:528)
   at java.net.Socket.init(Socket.java:425)
   at java.net.Socket.init(Socket.java:241)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:68)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/08/06 22:10:37 ERROR Executor: Exception in task 1478.0 in stage 1.0 (TID 
 11478)
 java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:69)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
   at 

[jira] [Created] (SPARK-2955) Test code fails to compile with mvn compile without install

2014-08-10 Thread Sean Owen (JIRA)
Sean Owen created SPARK-2955:


 Summary: Test code fails to compile with mvn compile without 
install 
 Key: SPARK-2955
 URL: https://issues.apache.org/jira/browse/SPARK-2955
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Minor


(This is the corrected follow-up to 
https://issues.apache.org/jira/browse/SPARK-2903 )

Right now, mvn compile test-compile fails to compile Spark. (Don't worry; 
mvn package works, so this is not major.) The issue stems from test code in 
some modules depending on test code in other modules. That is perfectly fine 
and supported by Maven.

It takes extra work to get this to work with scalatest, and this has been 
attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86

This formulation is not quite enough, since the SQL Core module's tests fail to 
compile for lack of finding test classes in SQL Catalyst, and likewise for most 
Streaming integration modules depending on core Streaming test code. Example:

{code}
[error] 
/Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23:
 not found: type PlanTest
[error] class QueryTest extends PlanTest {
[error] ^
[error] 
/Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28:
 package org.apache.spark.sql.test is not a value
[error]   test(SPARK-1669: cacheTable should be idempotent) {
[error]   ^
...
{code}

The issue I believe is that generation of a test-jar is bound here to the 
compile phase, but the test classes are not being compiled in this phase. It 
should bind to the test-compile phase.

It works when executing mvn package or mvn install since test-jar artifacts 
are actually generated available through normal Maven mechanisms as each module 
is built. They are then found normally, regardless of scalatest configuration.

It would be nice for a simple mvn compile test-compile to work since the test 
code is perfectly compilable given the Maven declarations.

On the plus side, this change is low-risk as it only affects tests.
[~yhuai] made the original scalatest change and has glanced at this and thinks 
it makes sense.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2955) Test code fails to compile with mvn compile without install

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092180#comment-14092180
 ] 

Apache Spark commented on SPARK-2955:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1879

 Test code fails to compile with mvn compile without install 
 

 Key: SPARK-2955
 URL: https://issues.apache.org/jira/browse/SPARK-2955
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Minor
  Labels: build, compile, scalatest, test, test-compile

 (This is the corrected follow-up to 
 https://issues.apache.org/jira/browse/SPARK-2903 )
 Right now, mvn compile test-compile fails to compile Spark. (Don't worry; 
 mvn package works, so this is not major.) The issue stems from test code in 
 some modules depending on test code in other modules. That is perfectly fine 
 and supported by Maven.
 It takes extra work to get this to work with scalatest, and this has been 
 attempted: 
 https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
 This formulation is not quite enough, since the SQL Core module's tests fail 
 to compile for lack of finding test classes in SQL Catalyst, and likewise for 
 most Streaming integration modules depending on core Streaming test code. 
 Example:
 {code}
 [error] 
 /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23:
  not found: type PlanTest
 [error] class QueryTest extends PlanTest {
 [error] ^
 [error] 
 /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28:
  package org.apache.spark.sql.test is not a value
 [error]   test(SPARK-1669: cacheTable should be idempotent) {
 [error]   ^
 ...
 {code}
 The issue I believe is that generation of a test-jar is bound here to the 
 compile phase, but the test classes are not being compiled in this phase. It 
 should bind to the test-compile phase.
 It works when executing mvn package or mvn install since test-jar 
 artifacts are actually generated available through normal Maven mechanisms as 
 each module is built. They are then found normally, regardless of scalatest 
 configuration.
 It would be nice for a simple mvn compile test-compile to work since the 
 test code is perfectly compilable given the Maven declarations.
 On the plus side, this change is low-risk as it only affects tests.
 [~yhuai] made the original scalatest change and has glanced at this and 
 thinks it makes sense.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2650) Caching tables larger than memory causes OOMs

2014-08-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2650:


Summary: Caching tables larger than memory causes OOMs  (was: Wrong initial 
sizes for in-memory column buffers)

 Caching tables larger than memory causes OOMs
 -

 Key: SPARK-2650
 URL: https://issues.apache.org/jira/browse/SPARK-2650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 The logic for setting up the initial column buffers is different for Spark 
 SQL compared to Shark and I'm seeing OOMs when caching tables that are larger 
 than available memory (where shark was okay).
 Two suspicious things: the intialSize is always set to 0 so we always go with 
 the default.  The default looks like it was copied from code like 10 * 1024 * 
 1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2650) Caching tables larger than memory causes OOMs

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092209#comment-14092209
 ] 

Apache Spark commented on SPARK-2650:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/1880

 Caching tables larger than memory causes OOMs
 -

 Key: SPARK-2650
 URL: https://issues.apache.org/jira/browse/SPARK-2650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 The logic for setting up the initial column buffers is different for Spark 
 SQL compared to Shark and I'm seeing OOMs when caching tables that are larger 
 than available memory (where shark was okay).
 Two suspicious things: the intialSize is always set to 0 so we always go with 
 the default.  The default looks like it was copied from code like 10 * 1024 * 
 1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2937) Separate out sampleByKeyExact in PairRDDFunctions as its own API

2014-08-10 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2937.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1866
[https://github.com/apache/spark/pull/1866]

 Separate out sampleByKeyExact in PairRDDFunctions as its own API
 

 Key: SPARK-2937
 URL: https://issues.apache.org/jira/browse/SPARK-2937
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Doris Xin
Assignee: Doris Xin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2956) Support transferring large blocks in Netty network module

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2956:
---

Summary: Support transferring large blocks in Netty network module  (was: 
Support transferring blocks larger than MTU size)

 Support transferring large blocks in Netty network module
 -

 Key: SPARK-2956
 URL: https://issues.apache.org/jira/browse/SPARK-2956
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 The existing Netty shuffle implementation does not support large blocks. 
 The culprit is in FileClientHandler.channelRead0().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2956) Support transferring blocks larger than MTU size

2014-08-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2956:
--

 Summary: Support transferring blocks larger than MTU size
 Key: SPARK-2956
 URL: https://issues.apache.org/jira/browse/SPARK-2956
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical


The existing Netty shuffle implementation does not support large blocks. 

The culprit is in FileClientHandler.channelRead0().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2957) Leverage Hadoop native io's fadvise and read-ahead in Netty transferTo

2014-08-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092290#comment-14092290
 ] 

Reynold Xin commented on SPARK-2957:


cc [~tlipcon] [~t...@lipcon.org] will probably bug you when we work on this. 

 Leverage Hadoop native io's fadvise and read-ahead in Netty transferTo
 --

 Key: SPARK-2957
 URL: https://issues.apache.org/jira/browse/SPARK-2957
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2468) Netty-based shuffle network module

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2468:
---

Summary: Netty-based shuffle network module  (was: Netty based network 
communication)

 Netty-based shuffle network module
 --

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2468) Netty based network communication

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2468:
---

Summary: Netty based network communication  (was: zero-copy shuffle network 
communication)

 Netty based network communication
 -

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2957) Leverage Hadoop native io's fadvise and read-ahead in Netty transferTo

2014-08-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2957:
--

 Summary: Leverage Hadoop native io's fadvise and read-ahead in 
Netty transferTo
 Key: SPARK-2957
 URL: https://issues.apache.org/jira/browse/SPARK-2957
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2956) Support transferring large blocks in Netty network module

2014-08-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2956:
---

Description: 
The existing Netty shuffle implementation does not support large blocks. 

The culprit is in FileClientHandler.channelRead0().

We should add a LengthFieldBasedFrameDecoder to the pipeline.

  was:
The existing Netty shuffle implementation does not support large blocks. 

The culprit is in FileClientHandler.channelRead0().


 Support transferring large blocks in Netty network module
 -

 Key: SPARK-2956
 URL: https://issues.apache.org/jira/browse/SPARK-2956
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 The existing Netty shuffle implementation does not support large blocks. 
 The culprit is in FileClientHandler.channelRead0().
 We should add a LengthFieldBasedFrameDecoder to the pipeline.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2957) Leverage Hadoop native io's fadvise and read-ahead in Netty transferTo

2014-08-10 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092294#comment-14092294
 ] 

Todd Lipcon commented on SPARK-2957:


Sure, happy to help

 Leverage Hadoop native io's fadvise and read-ahead in Netty transferTo
 --

 Key: SPARK-2957
 URL: https://issues.apache.org/jira/browse/SPARK-2957
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2958) FileClientHandler should not be shared in the pipeline

2014-08-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2958:
--

 Summary: FileClientHandler should not be shared in the pipeline
 Key: SPARK-2958
 URL: https://issues.apache.org/jira/browse/SPARK-2958
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin


Netty module creates a single FileClientHandler and shares it in all threads. 
We should create a new one for each pipeline thread.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2959) Use a single FileClient and Netty client thread pool

2014-08-10 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2959:
--

 Summary: Use a single FileClient and Netty client thread pool
 Key: SPARK-2959
 URL: https://issues.apache.org/jira/browse/SPARK-2959
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin


The current implementation creates a new Netty bootstrap for fetching each 
block. This is pretty crazy! 

We should reuse the bootstrap FileClient.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-08-10 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092355#comment-14092355
 ] 

Kousuke Saruta commented on SPARK-2677:
---

SPARK-2538 was resolved but there is still this issue.
I tried to resolve this issue in https://github.com/apache/spark/pull/1632

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Assignee: Josh Rosen
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks

2014-08-10 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky updated SPARK-2960:
-

Priority: Minor  (was: Major)

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
Priority: Minor
 Fix For: 1.0.2


 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks

2014-08-10 Thread Shay Rojansky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Rojansky updated SPARK-2960:
-

Summary: Spark executables fail to start via symlinks  (was: Spark 
executables failed to start via symlinks)

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
 Fix For: 1.0.2


 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2960) Spark executables failed to start via symlinks

2014-08-10 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2960:


 Summary: Spark executables failed to start via symlinks
 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
 Fix For: 1.0.2


The current scripts (e.g. pyspark) fail to run when they are executed via 
symlinks. A common Linux scenario would be to have Spark installed somewhere 
(e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2961) Use statistics to skip partitions when reading from in-memory columnar data

2014-08-10 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2961:
---

 Summary: Use statistics to skip partitions when reading from 
in-memory columnar data
 Key: SPARK-2961
 URL: https://issues.apache.org/jira/browse/SPARK-2961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2961) Use statistics to skip partitions when reading from in-memory columnar data

2014-08-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2961:


Target Version/s: 1.1.0

 Use statistics to skip partitions when reading from in-memory columnar data
 ---

 Key: SPARK-2961
 URL: https://issues.apache.org/jira/browse/SPARK-2961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2961) Use statistics to skip partitions when reading from in-memory columnar data

2014-08-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092373#comment-14092373
 ] 

Apache Spark commented on SPARK-2961:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/1883

 Use statistics to skip partitions when reading from in-memory columnar data
 ---

 Key: SPARK-2961
 URL: https://issues.apache.org/jira/browse/SPARK-2961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2062) VertexRDD.apply does not use the mergeFunc

2014-08-10 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092381#comment-14092381
 ] 

Larry Xiao commented on SPARK-2062:
---

Is anyone working on it? I want to take it.
My plan is to add a pass to do the merge, is it ok? [~ankurd]

 VertexRDD.apply does not use the mergeFunc
 --

 Key: SPARK-2062
 URL: https://issues.apache.org/jira/browse/SPARK-2062
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave

 Here: 
 https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2936) Migrate Netty network module from Java to Scala

2014-08-10 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2936.
---

Resolution: Fixed

 Migrate Netty network module from Java to Scala
 ---

 Key: SPARK-2936
 URL: https://issues.apache.org/jira/browse/SPARK-2936
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin

 The netty network module was originally written when Scala 2.9.x had a bug 
 that prevents a pure Scala implementation, and a subset of the files were 
 done in Java. We have since upgraded to Scala 2.10, and can migrate all Java 
 files now to Scala.
 https://github.com/netty/netty/issues/781
 https://github.com/mesos/spark/pull/522



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Mridul Muralidharan (JIRA)
Mridul Muralidharan created SPARK-2962:
--

 Summary: Suboptimal scheduling in spark
 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan



In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs are 
always scheduled with PROCESS_LOCAL

pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
locations - but which could come in 'later' : particularly relevant when spark 
app is just coming up and containers are still being added.

This causes a large number of non node local tasks to be scheduled incurring 
significant network transfers in the cluster when running with non trivial 
datasets.

The comment // Look for no-pref tasks after rack-local tasks since they can 
run anywhere. is misleading in the method code : locality levels start from 
process_local down to any, and so no prefs get scheduled much before rack.


Also note that, currentLocalityIndex is reset to the taskLocality returned by 
this method - so returning PROCESS_LOCAL as the level will trigger wait times 
again. (Was relevant before recent change to scheduler, and might be again 
based on resolution of this issue).


Found as part of writing test for SPARK-2931
 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092418#comment-14092418
 ] 

Matei Zaharia commented on SPARK-2962:
--

I thought this was fixed in https://github.com/apache/spark/pull/1313. Is that 
not the case?

 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092427#comment-14092427
 ] 

Mridul Muralidharan commented on SPARK-2962:


To give more context; 

a) Our jobs start with load data from dfs as starting point : and so this is 
the first stage that gets executed.

b) We are sleeping for 1 minute before starting the jobs (in case cluster is 
busy, etc) - unfortunately, this is not sufficient and iirc there is no 
programmatic way to wait more deterministically for X% of node (was something 
added to alleviate this ? I did see some discussion)

c) This becomes more of a problem because spark does not honour preferred 
location anymore while running in yarn. See SPARK-208 - due to 1.0 interface 
changes.
[ Practically, if we are using large enough number of nodes (with replication 
of 3 or higher), usually we do end up with quite of lot of data local tasks 
eventually - so (c) is not an immediate concern for our current jobs assuming 
(b) is not an issue, though it is suboptimal in general case ]



 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092430#comment-14092430
 ] 

Mridul Muralidharan commented on SPARK-2962:


Hi [~matei],

  I am referencing the latest code (as of yday night).

pendingTasksWithNoPrefs currnetly contains both tasks which truely have no 
preference, and tasks which have preference which are unavailble - and the 
latter is what is triggering this, since that can change during the execution 
of the stage.
Hope I am not missing something ?

Thanks,
Mridul

 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092431#comment-14092431
 ] 

Mridul Muralidharan commented on SPARK-2962:


Note, I dont think this is a regression in 1.1, and probably existed much 
earlier too.
Other issues are making us notice this (like SPARK-2089) - we moved to 1.1 from 
0.9 recently.

 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092427#comment-14092427
 ] 

Mridul Muralidharan edited comment on SPARK-2962 at 8/11/14 4:35 AM:
-

To give more context; 

a) Our jobs start with load data from dfs as starting point : and so this is 
the first stage that gets executed.

b) We are sleeping for 1 minute before starting the jobs (in case cluster is 
busy, etc) - unfortunately, this is not sufficient and iirc there is no 
programmatic way to wait more deterministically for X% of node (was something 
added to alleviate this ? I did see some discussion)

c) This becomes more of a problem because spark does not honour preferred 
location anymore while running in yarn. See SPARK-2089 - due to 1.0 interface 
changes.
[ Practically, if we are using large enough number of nodes (with replication 
of 3 or higher), usually we do end up with quite of lot of data local tasks 
eventually - so (c) is not an immediate concern for our current jobs assuming 
(b) is not an issue, though it is suboptimal in general case ]




was (Author: mridulm80):
To give more context; 

a) Our jobs start with load data from dfs as starting point : and so this is 
the first stage that gets executed.

b) We are sleeping for 1 minute before starting the jobs (in case cluster is 
busy, etc) - unfortunately, this is not sufficient and iirc there is no 
programmatic way to wait more deterministically for X% of node (was something 
added to alleviate this ? I did see some discussion)

c) This becomes more of a problem because spark does not honour preferred 
location anymore while running in yarn. See SPARK-208 - due to 1.0 interface 
changes.
[ Practically, if we are using large enough number of nodes (with replication 
of 3 or higher), usually we do end up with quite of lot of data local tasks 
eventually - so (c) is not an immediate concern for our current jobs assuming 
(b) is not an issue, though it is suboptimal in general case ]



 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2912) Jenkins should include the commit hash in his messages

2014-08-10 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092434#comment-14092434
 ] 

Michael Yannakopoulos commented on SPARK-2912:
--

Hi Nicholas,

I can work on this issue!

Thanks,
Michael

 Jenkins should include the commit hash in his messages
 --

 Key: SPARK-2912
 URL: https://issues.apache.org/jira/browse/SPARK-2912
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas

 When there are multiple test cycles within a PR, it is not obvious what cycle 
 applies to what set of changes. This makes it more likely for committers to 
 merge a PR that has had new commits added since the last PR.
 Requirements:
 * Add the commit hash to Jenkins's messages so it's clear what the test cycle 
 corresponds to.
 * While you're at it, polish the formatting a bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2912) Jenkins should include the commit hash in his messages

2014-08-10 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092435#comment-14092435
 ] 

Patrick Wendell commented on SPARK-2912:


Hey Michael - I believe [~nchammas] is already working on it actually, so I 
assigned him. 

 Jenkins should include the commit hash in his messages
 --

 Key: SPARK-2912
 URL: https://issues.apache.org/jira/browse/SPARK-2912
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas

 When there are multiple test cycles within a PR, it is not obvious what cycle 
 applies to what set of changes. This makes it more likely for committers to 
 merge a PR that has had new commits added since the last PR.
 Requirements:
 * Add the commit hash to Jenkins's messages so it's clear what the test cycle 
 corresponds to.
 * While you're at it, polish the formatting a bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2912) Jenkins should include the commit hash in his messages

2014-08-10 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092437#comment-14092437
 ] 

Michael Yannakopoulos commented on SPARK-2912:
--

Thanks for the quick reply Patrick! Nice, I will try to find another open issue 
so as to resolve it.

 Jenkins should include the commit hash in his messages
 --

 Key: SPARK-2912
 URL: https://issues.apache.org/jira/browse/SPARK-2912
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas

 When there are multiple test cycles within a PR, it is not obvious what cycle 
 applies to what set of changes. This makes it more likely for committers to 
 merge a PR that has had new commits added since the last PR.
 Requirements:
 * Add the commit hash to Jenkins's messages so it's clear what the test cycle 
 corresponds to.
 * While you're at it, polish the formatting a bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2963) There no documentation for building about SparkSQL

2014-08-10 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2963:
-

 Summary: There no documentation for building about SparkSQL
 Key: SPARK-2963
 URL: https://issues.apache.org/jira/browse/SPARK-2963
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta


Currently, if we'd like to use SparkSQL, we need to use -Phive-thriftserver 
option on building but it's implicit.
I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org