[jira] [Updated] (SPARK-5648) suppot alter view/table tableName unset tblproperties(k)

2015-02-06 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5648:
--
Description: 
make hivecontext support unset tblproperties
like :
alter view viewName unset tblproperties(k)
alter table tableName unset tblproperties(k)






  was:
make hivecontext support unset tblproperties
like :







 suppot alter view/table tableName unset tblproperties(k) 
 -

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5648) suppot alter ... unset tblproperties(key)

2015-02-06 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5648:
--
Summary: suppot alter ... unset tblproperties(key)   (was: suppot 
alter ... unset tblproperties(k) )

 suppot alter ... unset tblproperties(key) 
 --

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5648) suppot alter view/table tableName unset tblproperties(k)

2015-02-06 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5648:
--
Description: 
make hivecontext support unset tblproperties
like :






 suppot alter view/table tableName unset tblproperties(k) 
 -

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties
 like :



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5648) suppot alter ... unset tblproperties(k)

2015-02-06 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5648:
--
Summary: suppot alter ... unset tblproperties(k)   (was: suppot alter 
view/table tableName unset tblproperties(k) )

 suppot alter ... unset tblproperties(k) 
 

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5648) support alter ... unset tblproperties(key)

2015-02-06 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5648:
--
Summary: support alter ... unset tblproperties(key)   (was: suppot 
alter ... unset tblproperties(key) )

 support alter ... unset tblproperties(key) 
 ---

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties(key)
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5648) suppot alter view/table tableName unset tblproperties(k)

2015-02-06 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-5648:
-

 Summary: suppot alter view/table tableName unset 
tblproperties(k) 
 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5648) suppot alter ... unset tblproperties(key)

2015-02-06 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5648:
--
Description: 
make hivecontext support unset tblproperties(key)
like :
alter view viewName unset tblproperties(k)
alter table tableName unset tblproperties(k)






  was:
make hivecontext support unset tblproperties
like :
alter view viewName unset tblproperties(k)
alter table tableName unset tblproperties(k)







 suppot alter ... unset tblproperties(key) 
 --

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties(key)
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2789) Apply names to RDD to becoming SchemaRDD

2015-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308825#comment-14308825
 ] 

Apache Spark commented on SPARK-2789:
-

User 'dwmclary' has created a pull request for this issue:
https://github.com/apache/spark/pull/4421

 Apply names to RDD to becoming SchemaRDD
 

 Key: SPARK-2789
 URL: https://issues.apache.org/jira/browse/SPARK-2789
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Davies Liu

 In order to simplify apply schema, we could add an API called applyNames(), 
 which will infer the types in the RDD and create an schema with names, then 
 apply  this schema on it to becoming a SchemaRDD. The names could be provides 
 by String with names separated  by space.
 For example:
 rdd = sc.parallelize([(Alice, 10)])
 srdd = sqlCtx.applyNames(rdd, name age)
 User don't need to create an case class or StructType to have all power of 
 Spark SQL.
 The string presentation of schema also could support nested structure 
 (MapType, ArrayType and StructType), for example:
 name age address(city zip) likes[title stars] props{[value type]}
 It will equal to unnamed schema:
 root
 |--name
 |--age
 |--address
 |--|--city
 |--|--zip
 |--likes
 |--|--element
 |--|--|--title
 |--|--|--starts
 |--props
 |--|--key:
 |--|--value:
 |--|--|--element
 |--|--|--|--value
 |--|--|--|--type
 All the names of fields are seperated by space, the struct of field (if it is 
 nested type) follows the name without space, wich shoud startswith ( 
 (StructType) or [ (ArrayType) or { (MapType).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5648) support alter ... unset tblproperties(key)

2015-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308835#comment-14308835
 ] 

Apache Spark commented on SPARK-5648:
-

User 'DoingDone9' has created a pull request for this issue:
https://github.com/apache/spark/pull/4423

 support alter ... unset tblproperties(key) 
 ---

 Key: SPARK-5648
 URL: https://issues.apache.org/jira/browse/SPARK-5648
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: DoingDone9

 make hivecontext support unset tblproperties(key)
 like :
 alter view viewName unset tblproperties(k)
 alter table tableName unset tblproperties(k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5598) Model import/export for ALS

2015-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308836#comment-14308836
 ] 

Apache Spark commented on SPARK-5598:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4422

 Model import/export for ALS
 ---

 Key: SPARK-5598
 URL: https://issues.apache.org/jira/browse/SPARK-5598
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng

 Please see parent JIRA for details on model import/export plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode

2015-02-06 Thread Andrew Rowson (JIRA)
Andrew Rowson created SPARK-5655:


 Summary: YARN Auxiliary Shuffle service can't access shuffle files 
on Hadoop cluster configured in secure mode
 Key: SPARK-5655
 URL: https://issues.apache.org/jira/browse/SPARK-5655
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
Reporter: Andrew Rowson


When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN environment to test, otherwise I'd be able to suggest a 
patch.

I believe this is a related issue in the MapReduce framwork: 
https://issues.apache.org/jira/browse/MAPREDUCE-3728



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5593) Replace BlockManager listener with Executor listener in ExecutorAllocationListener

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5593.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Lianhui Wang
Target Version/s: 1.3.0

 Replace BlockManager listener with Executor listener in 
 ExecutorAllocationListener
 --

 Key: SPARK-5593
 URL: https://issues.apache.org/jira/browse/SPARK-5593
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 More strictly, in ExecutorAllocationListener, we need to replace 
 onBlockManagerAdded, onBlockManagerRemoved with 
 onExecutorAdded,onExecutorRemoved. because at some time, onExecutorAdded and 
 onExecutorRemoved are more accurate to express these meanings. example at 
 SPARK-5529, BlockManager has been removed,but executor is existed.
 [~andrewor14] [~sandyr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5653) in ApplicationMaster rename isDriver to isClusterMode

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5653:
-
Affects Version/s: 1.2.0

 in ApplicationMaster rename isDriver to isClusterMode
 -

 Key: SPARK-5653
 URL: https://issues.apache.org/jira/browse/SPARK-5653
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 in ApplicationMaster rename isDriver to isClusterMode,because in Client it 
 uses isClusterMode,ApplicationMaster should keep consistent with it and uses 
 isClusterMode.isClusterMode is easier to understand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5470) use defaultClassLoader of Serializer to load classes of classesToRegister in KryoSerializer

2015-02-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5470:
--
Fix Version/s: 1.3.0

 use defaultClassLoader of Serializer to load classes of classesToRegister in 
 KryoSerializer
 ---

 Key: SPARK-5470
 URL: https://issues.apache.org/jira/browse/SPARK-5470
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0, 1.4.0


 Now KryoSerializer load classes of classesToRegister at the time of its 
 initialization. when we set spark.kryo.classesToRegister=class1, it will 
 throw  SparkException(Failed to load class to register with Kryo.
 because in KryoSerializer's initialization, classLoader cannot include class 
 of user's jars.
 we need to use defaultClassLoader of Serializer in newKryo(), because 
 executor will reset defaultClassLoader of Serializer after Serializer's 
 initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5593) Replace BlockManager listener with Executor listener in ExecutorAllocationListener

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5593:
-
Affects Version/s: 1.2.0

 Replace BlockManager listener with Executor listener in 
 ExecutorAllocationListener
 --

 Key: SPARK-5593
 URL: https://issues.apache.org/jira/browse/SPARK-5593
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Lianhui Wang

 More strictly, in ExecutorAllocationListener, we need to replace 
 onBlockManagerAdded, onBlockManagerRemoved with 
 onExecutorAdded,onExecutorRemoved. because at some time, onExecutorAdded and 
 onExecutorRemoved are more accurate to express these meanings. example at 
 SPARK-5529, BlockManager has been removed,but executor is existed.
 [~andrewor14] [~sandyr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5653) in ApplicationMaster rename isDriver to isClusterMode

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5653:
-
Affects Version/s: (was: 1.2.0)
   1.0.0

 in ApplicationMaster rename isDriver to isClusterMode
 -

 Key: SPARK-5653
 URL: https://issues.apache.org/jira/browse/SPARK-5653
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 in ApplicationMaster rename isDriver to isClusterMode,because in Client it 
 uses isClusterMode,ApplicationMaster should keep consistent with it and uses 
 isClusterMode.isClusterMode is easier to understand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5653) in ApplicationMaster rename isDriver to isClusterMode

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5653.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Lianhui Wang
Target Version/s: 1.3.0

 in ApplicationMaster rename isDriver to isClusterMode
 -

 Key: SPARK-5653
 URL: https://issues.apache.org/jira/browse/SPARK-5653
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 in ApplicationMaster rename isDriver to isClusterMode,because in Client it 
 uses isClusterMode,ApplicationMaster should keep consistent with it and uses 
 isClusterMode.isClusterMode is easier to understand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5396) Syntax error in spark scripts on windows.

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5396.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.3.0  (was: 1.2.0)

 Syntax error in spark scripts on windows.
 -

 Key: SPARK-5396
 URL: https://issues.apache.org/jira/browse/SPARK-5396
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Window 7 and Window 8.1.
Reporter: Vladimir Protsenko
Assignee: Masayoshi TSUZUKI
Priority: Critical
 Fix For: 1.3.0

 Attachments: windows7.png, windows8.1.png


 I made the following steps: 
 1. downloaded and installed Scala 2.11.5 
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
 package (in git bash) 
 After installation tried to run spark-shell.cmd in cmd shell and it says 
 there is a syntax error in file. The same with spark-shell2.cmd, 
 spark-submit.cmd and  spark-submit2.cmd.
 !windows7.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k

2015-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309684#comment-14309684
 ] 

Apache Spark commented on SPARK-5656:
-

User 'mbittmann' has created a pull request for this issue:
https://github.com/apache/spark/pull/4433

 NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large 
 n and/or large k
 --

 Key: SPARK-5656
 URL: https://issues.apache.org/jira/browse/SPARK-5656
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Mark Bittmann
Priority: Minor

 Large values of n or k in EigenValueDecomposition.symmetricEigs will fail 
 with a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
 Integer.MAX_VALUE. These values are currently unchecked and allow for the 
 array to be initialized to a value greater than Integer.MAX_VALUE. I have 
 written the below 'require' to fail this condition gracefully. I will submit 
 a pull request. 
 require(ncv * n.toLong  Integer.MAX_VALUE, Product of 2*k*n must be smaller 
 than  +
   sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
 dimension n = $n)
 Here is the exception that occurs from computeSVD with large k and/or n: 
 Exception in thread main java.lang.NegativeArraySizeException
   at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-540) Add API to customize in-memory representation of RDDs

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-540:
---
Component/s: Spark Core

 Add API to customize in-memory representation of RDDs
 -

 Key: SPARK-540
 URL: https://issues.apache.org/jira/browse/SPARK-540
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia

 Right now the choice between serialized caching and just Java objects in dev 
 is fine, but it might be cool to also support structures such as 
 column-oriented storage through arrays of primitives without forcing it 
 through the serialization interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4705:
-
Target Version/s: 1.4.0

 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin

 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5619) Support 'show roles' in HiveContext

2015-02-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5619.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4397
[https://github.com/apache/spark/pull/4397]

 Support 'show roles' in HiveContext
 ---

 Key: SPARK-5619
 URL: https://issues.apache.org/jira/browse/SPARK-5619
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yadong Qi
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5657) Add PySpark Avro Output Format example

2015-02-06 Thread Stanislav Los (JIRA)
Stanislav Los created SPARK-5657:


 Summary: Add PySpark Avro Output Format example
 Key: SPARK-5657
 URL: https://issues.apache.org/jira/browse/SPARK-5657
 Project: Spark
  Issue Type: Improvement
Reporter: Stanislav Los


There is an Avro Input Format example that shows how to read Avro data in 
PySpark, but nothing shows how to write from PySpark to Avro. The main 
challenge, a Converter needs an Avro schema to build a record, but current 
Spark API doesn't provide a way to supply extra parameters to custom 
converters. Provided workaround is possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5278) check ambiguous reference to fields in Spark SQL is incompleted

2015-02-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5278.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4068
[https://github.com/apache/spark/pull/4068]

 check ambiguous reference to fields in Spark SQL is incompleted
 ---

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
 Fix For: 1.3.0


 at hive context
 for json string like
 {code}{a: {b: 1, B: 2}}{code}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {code}{a: [{b: 1, B: 2}]}{code}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5416) Initialize Executor.threadPool before ExecutorSource

2015-02-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5416:
--
Fix Version/s: 1.3.0

 Initialize Executor.threadPool before ExecutorSource
 

 Key: SPARK-5416
 URL: https://issues.apache.org/jira/browse/SPARK-5416
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Ryan Williams
Assignee: Ryan Williams
Priority: Minor
 Fix For: 1.3.0, 1.4.0


 I recently saw some NPEs from 
 [{{ExecutorSource:44}}|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/ExecutorSource.scala#L44]
  in the first couple seconds of my executors' being initialized.
 I think that {{ExecutorSource}} was trying to report these metrics before its 
 threadpool was initialized; there are a few LoC between the source being 
 registered 
 ([Executor.scala:82|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L82])
  and the threadpool being initialized 
 ([Executor.scala:106|https://github.com/apache/spark/blob/0497ea51ac345f8057d222a18dbbf8eae78f5b92/core/src/main/scala/org/apache/spark/executor/Executor.scala#L106]).
 We should initialize the threapool before the ExecutorSource is registered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5396) Syntax error in spark scripts on windows.

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5396:
-
Affects Version/s: (was: 1.2.0)
   1.3.0

 Syntax error in spark scripts on windows.
 -

 Key: SPARK-5396
 URL: https://issues.apache.org/jira/browse/SPARK-5396
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
 Environment: Window 7 and Window 8.1.
Reporter: Vladimir Protsenko
Assignee: Masayoshi TSUZUKI
Priority: Critical
 Fix For: 1.3.0

 Attachments: windows7.png, windows8.1.png


 I made the following steps: 
 1. downloaded and installed Scala 2.11.5 
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
 package (in git bash) 
 After installation tried to run spark-shell.cmd in cmd shell and it says 
 there is a syntax error in file. The same with spark-shell2.cmd, 
 spark-submit.cmd and  spark-submit2.cmd.
 !windows7.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5603) Preinsert casting and renaming rule is needed in the Analyzer

2015-02-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5603.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4373
[https://github.com/apache/spark/pull/4373]

 Preinsert casting and renaming rule is needed in the Analyzer
 -

 Key: SPARK-5603
 URL: https://issues.apache.org/jira/browse/SPARK-5603
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 For an INSERT INTO/OVERWRITE statement, we should add necessary Cast and 
 Alias to the output of the query.
 {code}
 CREATE TEMPORARY TABLE jsonTable (a int, b string)
 USING org.apache.spark.sql.json.DefaultSource
 OPTIONS (
   path '...'
 )
 INSERT OVERWRITE TABLE jsonTable SELECT a * 2, a * 4 FROM table
 {code}
 For a*2, we should create an Alias, so the InsertableRelation can know it is 
 the column a. For a*4, it is actually the column b in jsonTable. We should 
 first cast it to StringType and add an Alias b to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k

2015-02-06 Thread Mark Bittmann (JIRA)
Mark Bittmann created SPARK-5656:


 Summary: NegativeArraySizeException in 
EigenValueDecomposition.symmetricEigs for large n and/or large k
 Key: SPARK-5656
 URL: https://issues.apache.org/jira/browse/SPARK-5656
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Mark Bittmann
Priority: Minor


Large values of n or k in EigenValueDecomposition.symmetricEigs will fail with 
a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
Integer.MAX_VALUE. These values are currently unchecked and allow for the array 
to be initialized to a value greater than Integer.MAX_VALUE. I have written the 
below 'require' to fail this condition gracefully. I will submit a pull 
request. 

require(ncv * n  Integer.MAX_VALUE, Product of 2*k*n must be smaller than  +
  sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
dimension n = $n)


Here is the exception that occurs from computeSVD with large k and/or n: 

Exception in thread main java.lang.NegativeArraySizeException
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3956:

Component/s: PySpark

 Python API for Distributed Matrix
 -

 Key: SPARK-3956
 URL: https://issues.apache.org/jira/browse/SPARK-3956
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Minor

 Python API for distributed matrix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode

2015-02-06 Thread Andrew Rowson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Rowson updated SPARK-5655:
-
Description: 
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:
{code:java}
java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)
{code}

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

{code}
/data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data
{code}
But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN environment to test, otherwise I'd be able to suggest a 
patch.

I believe this is a related issue in the MapReduce framwork: 
https://issues.apache.org/jira/browse/MAPREDUCE-3728

  was:
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)


The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. 

[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode

2015-02-06 Thread Andrew Rowson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Rowson updated SPARK-5655:
-
Description: 
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN environment to test, otherwise I'd be able to suggest a 
patch.

I believe this is a related issue in the MapReduce framwork: 
https://issues.apache.org/jira/browse/MAPREDUCE-3728

  was:
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN 

[jira] [Commented] (SPARK-1799) Add init script to the debian packaging

2015-02-06 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309753#comment-14309753
 ] 

Nicholas Chammas commented on SPARK-1799:
-

cc [~markhamstra], [~srowen], [~pwendell]

 Add init script to the debian packaging
 ---

 Key: SPARK-1799
 URL: https://issues.apache.org/jira/browse/SPARK-1799
 Project: Spark
  Issue Type: New Feature
Reporter: Nicolas Lalevée

 See https://github.com/apache/spark/pull/733



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-06 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309784#comment-14309784
 ] 

Patrick Wendell commented on SPARK-5388:


On DELETE, I'll defer to you guys, have zero strong feelings either way.

 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 For more detail, please see the most recent design doc attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5595) In memory data cache should be invalidated after insert into/overwrite

2015-02-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5595.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4373
[https://github.com/apache/spark/pull/4373]

 In memory data cache should be invalidated after insert into/overwrite
 --

 Key: SPARK-5595
 URL: https://issues.apache.org/jira/browse/SPARK-5595
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4337) Add ability to cancel pending requests to YARN

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4337.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Sandy Ryza
Target Version/s: 1.3.0

 Add ability to cancel pending requests to YARN
 --

 Key: SPARK-4337
 URL: https://issues.apache.org/jira/browse/SPARK-4337
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.3.0


 This will be useful for things like SPARK-4136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode

2015-02-06 Thread Andrew Rowson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Rowson updated SPARK-5655:
-
Description: 
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:
{code|borderStyle=solid}
java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)
{/code}

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN environment to test, otherwise I'd be able to suggest a 
patch.

I believe this is a related issue in the MapReduce framwork: 
https://issues.apache.org/jira/browse/MAPREDUCE-3728

  was:
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is 

[jira] [Updated] (SPARK-5656) NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large n and/or large k

2015-02-06 Thread Mark Bittmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Bittmann updated SPARK-5656:
-
Description: 
Large values of n or k in EigenValueDecomposition.symmetricEigs will fail with 
a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
Integer.MAX_VALUE. These values are currently unchecked and allow for the array 
to be initialized to a value greater than Integer.MAX_VALUE. I have written the 
below 'require' to fail this condition gracefully. I will submit a pull 
request. 

require(ncv * n.toLong  Integer.MAX_VALUE, Product of 2*k*n must be smaller 
than  +
  sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
dimension n = $n)


Here is the exception that occurs from computeSVD with large k and/or n: 

Exception in thread main java.lang.NegativeArraySizeException
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)

  was:
Large values of n or k in EigenValueDecomposition.symmetricEigs will fail with 
a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
Integer.MAX_VALUE. These values are currently unchecked and allow for the array 
to be initialized to a value greater than Integer.MAX_VALUE. I have written the 
below 'require' to fail this condition gracefully. I will submit a pull 
request. 

require(ncv * n  Integer.MAX_VALUE, Product of 2*k*n must be smaller than  +
  sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
dimension n = $n)


Here is the exception that occurs from computeSVD with large k and/or n: 

Exception in thread main java.lang.NegativeArraySizeException
at 
org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)


 NegativeArraySizeException in EigenValueDecomposition.symmetricEigs for large 
 n and/or large k
 --

 Key: SPARK-5656
 URL: https://issues.apache.org/jira/browse/SPARK-5656
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Mark Bittmann
Priority: Minor

 Large values of n or k in EigenValueDecomposition.symmetricEigs will fail 
 with a NegativeArraySizeException. Specifically, this occurs when 2*n*k  
 Integer.MAX_VALUE. These values are currently unchecked and allow for the 
 array to be initialized to a value greater than Integer.MAX_VALUE. I have 
 written the below 'require' to fail this condition gracefully. I will submit 
 a pull request. 
 require(ncv * n.toLong  Integer.MAX_VALUE, Product of 2*k*n must be smaller 
 than  +
   sInteger.MAX_VALUE. Found required eigenvalues k = $k and matrix 
 dimension n = $n)
 Here is the exception that occurs from computeSVD with large k and/or n: 
 Exception in thread main java.lang.NegativeArraySizeException
   at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:85)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:258)
   at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:190)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5618) Optimise utility code.

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5618.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Makoto Fukuhara
Target Version/s: 1.3.0

 Optimise utility code.
 --

 Key: SPARK-5618
 URL: https://issues.apache.org/jira/browse/SPARK-5618
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Makoto Fukuhara
Assignee: Makoto Fukuhara
Priority: Minor
 Fix For: 1.3.0


 I refactored the evaluation timing and unnecessary Regex API call.
 Because Regex API is heavy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-560) Specialize RDDs / iterators

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-560:
---
Component/s: Spark Core

 Specialize RDDs / iterators
 ---

 Key: SPARK-560
 URL: https://issues.apache.org/jira/browse/SPARK-560
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Matei Zaharia

 When you're working on in-memory data, the overhead of boxing / unboxing 
 starts to matter, and it looks like specializing would give a 2-4x speedup. 
 We can't just throw in @specialized though because Scala's Iterator is not 
 specialized. We probably need to make our own and also ensure that the right 
 methods get called remotely when you have a chain of RDDs (i.e. it doesn't 
 lose its specialization).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-06 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309825#comment-14309825
 ] 

Patrick Wendell commented on SPARK-5388:


One the boolean and numeric values. I don't mind one way or the other how they 
are handled programmatically (since we are not exposing this). However, it does 
seem weird that in the wire protocol defines these as string types. I looked at 
a few other API's, Github, Twitter, etc and they all use proper boolean types. 
So I'd definitely recommend setting them as proper types in the JSON, and if 
that's easier to do by making them nullable Boolean and Long values, seems like 
a good approach.

 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 For more detail, please see the most recent design doc attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5324) Results of describe can't be queried

2015-02-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5324.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4249
[https://github.com/apache/spark/pull/4249]

 Results of describe can't be queried
 

 Key: SPARK-5324
 URL: https://issues.apache.org/jira/browse/SPARK-5324
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Michael Armbrust
 Fix For: 1.3.0


 {code}
 sql(DESCRIBE TABLE test).registerTempTable(describeTest)
 sql(SELECT * FROM describeTest).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5628) Add option to return spark-ec2 version

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5628:

Fix Version/s: 1.2.2

 Add option to return spark-ec2 version
 --

 Key: SPARK-5628
 URL: https://issues.apache.org/jira/browse/SPARK-5628
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.2, 1.4.0


 We need a {{--version}} option for {{spark-ec2}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5628) Add option to return spark-ec2 version

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5628:

Labels: backport-needed  (was: )

 Add option to return spark-ec2 version
 --

 Key: SPARK-5628
 URL: https://issues.apache.org/jira/browse/SPARK-5628
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.2, 1.4.0


 We need a {{--version}} option for {{spark-ec2}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5636) Lower dynamic allocation add interval

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5636.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Lower dynamic allocation add interval
 -

 Key: SPARK-5636
 URL: https://issues.apache.org/jira/browse/SPARK-5636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.3.0


 The current default of 1 min is a little long especially since a recent patch 
 causes the number of executors to start at 0 by default. We should ramp up 
 much more quickly in the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode

2015-02-06 Thread Andrew Rowson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Rowson updated SPARK-5655:
-
Description: 
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)


The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN environment to test, otherwise I'd be able to suggest a 
patch.

I believe this is a related issue in the MapReduce framwork: 
https://issues.apache.org/jira/browse/MAPREDUCE-3728

  was:
When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:
{code|borderStyle=solid}
java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)
{/code}

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is 

[jira] [Commented] (SPARK-4877) userClassPathFirst doesn't handle user classes inheriting from parent

2015-02-06 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309666#comment-14309666
 ] 

Josh Rosen commented on SPARK-4877:
---

I've gone ahead and committed this PR because it fixes a known bug and adds a 
new test case.  Both the old and new code overloaded findClass; I think the 
findClass vs. loadClass change is related to the this JIRA, but kind of 
orthogonal to the fix here.  If you think that we should re-work our 
classloader to change its overriding strategy, let's do that in a separate 
followup PR.

 userClassPathFirst doesn't handle user classes inheriting from parent
 -

 Key: SPARK-4877
 URL: https://issues.apache.org/jira/browse/SPARK-4877
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Stephen Haberman
Assignee: Stephen Haberman
 Fix For: 1.3.0


 We're trying out userClassPathFirst.
 To do so, we make an uberjar that does not contain Spark or Scala classes 
 (because we want those to load from the parent classloader, otherwise we'll 
 get errors like scala.Function0 != scala.Function0 since they'd load from 
 different class loaders).
 (Tangentially, some isolation classloaders like Jetty whitelist certain 
 packages, like spark/* and scala/*, to only come from the parent classloader, 
 so that technically if the user still messes up and leaks the Scala/Spark 
 jars into their uberjar, it won't blow up; this would be a good enhancement, 
 I think.)
 Anyway, we have a custom Kryo registrar, which ships in our uberjar, but 
 since it extends spark.KryoRegistrator, which is not in our uberjar, we get 
 a ClassNotFoundException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2945) Allow specifying num of executors in the context configuration

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2945.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: WangTaoTheTonic
Target Version/s: 1.3.0

 Allow specifying num of executors in the context configuration
 --

 Key: SPARK-2945
 URL: https://issues.apache.org/jira/browse/SPARK-2945
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky
Assignee: WangTaoTheTonic
 Fix For: 1.3.0


 Running on YARN, the only way to specify the number of executors seems to be 
 on the command line of spark-submit, via the --num-executors switch.
 In many cases this is too early. Our Spark app receives some cmdline 
 arguments which determine the amount of work that needs to be done - and that 
 affects the number of executors it ideally requires. Ideally, the Spark 
 context configuration would support specifying this like any other config 
 param.
 Our current workaround is a wrapper script that determines how much work is 
 needed, and which itself launches spark-submit with the number passed to 
 --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core

2015-02-06 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309935#comment-14309935
 ] 

DeepakVohra commented on SPARK-5625:


Not clear if the assembly jar is to be extracted. Is the assembly jar to be 
extracted? Because if added as such to classpath the core classes are not found.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5593) Replace BlockManager listener with Executor listener in ExecutorAllocationListener

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5593:
-
Component/s: Spark Core

 Replace BlockManager listener with Executor listener in 
 ExecutorAllocationListener
 --

 Key: SPARK-5593
 URL: https://issues.apache.org/jira/browse/SPARK-5593
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 More strictly, in ExecutorAllocationListener, we need to replace 
 onBlockManagerAdded, onBlockManagerRemoved with 
 onExecutorAdded,onExecutorRemoved. because at some time, onExecutorAdded and 
 onExecutorRemoved are more accurate to express these meanings. example at 
 SPARK-5529, BlockManager has been removed,but executor is existed.
 [~andrewor14] [~sandyr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5625) Spark binaries do not incude Spark Core

2015-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309655#comment-14309655
 ] 

Sean Owen commented on SPARK-5625:
--

These are in the assembly. The idea is that it is one artifact containing the 
entire Spark distribution.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-706) Failures in block manager put leads to task hanging

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-706:
---
Component/s: Block Manager

 Failures in block manager put leads to task hanging
 ---

 Key: SPARK-706
 URL: https://issues.apache.org/jira/browse/SPARK-706
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 0.6.0, 0.6.1, 0.7.0, 0.6.2
Reporter: Reynold Xin

 Reported in this thread: 
 https://groups.google.com/forum/?fromgroups=#!topic/shark-users/Q_SiIDzVtZw
 The following exception in block manager leaves the block marked as pending. 
 {code}
 13/02/26 06:14:56 ERROR executor.Executor: Exception in task ID 39
 com.esotericsoftware.kryo.SerializationException: Buffer limit exceeded 
 writing object of type: shark.ColumnarWritable
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:492)
   at spark.KryoSerializationStream.writeObject(KryoSerializer.scala:78)
   at 
 spark.serializer.SerializationStream$class.writeAll(Serializer.scala:58)
   at spark.KryoSerializationStream.writeAll(KryoSerializer.scala:73)
   at spark.storage.DiskStore.putValues(DiskStore.scala:63)
   at spark.storage.BlockManager.dropFromMemory(BlockManager.scala:779)
   at spark.storage.MemoryStore.tryToPut(MemoryStore.scala:162)
   at spark.storage.MemoryStore.putValues(MemoryStore.scala:57)
   at spark.storage.BlockManager.put(BlockManager.scala:582)
   at spark.CacheTracker.getOrCompute(CacheTracker.scala:215)
   at spark.RDD.iterator(RDD.scala:159)
   at spark.scheduler.ResultTask.run(ResultTask.scala:18)
   at spark.executor.Executor$TaskRunner.run(Executor.scala:76)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:679)
 {code}
 When the block is read, the task is stuck in BlockInfo.waitForReady().
 We should propagate the error back to the master instead of hanging the slave 
 node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3600) RDD[Double] doesn't use primitive arrays for caching

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3600:

Component/s: Spark Core

 RDD[Double] doesn't use primitive arrays for caching
 

 Key: SPARK-3600
 URL: https://issues.apache.org/jira/browse/SPARK-3600
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xiangrui Meng

 RDD's classTag is not passed in through CacheManager. So RDD[Double] uses 
 object arrays for caching, which leads to huge overhead. However, we need to 
 send the classTag down many levels to make it work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4024) Remember user preferences for metrics to show in the UI

2015-02-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-4024:

Component/s: Web UI

 Remember user preferences for metrics to show in the UI
 ---

 Key: SPARK-4024
 URL: https://issues.apache.org/jira/browse/SPARK-4024
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Kay Ousterhout
Priority: Minor

 We should remember the metrics a user has previously chosen to display for 
 each stage, so that the user doesn't need to reselect interesting metric each 
 time they open a stage detail page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-06 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309782#comment-14309782
 ] 

Matei Zaharia commented on SPARK-5654:
--

Yup, there's a tradeoff, but given that this is a language API and not an 
algorithm, input source or anything like that, I think it's important to 
support it along with the core engine. R is extremely popular for data science, 
more so than Python, and it fits well with many existing concepts in Spark.

 Integrate SparkR into Apache Spark
 --

 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
Reporter: Shivaram Venkataraman

 The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
 from R. The project was started at the AMPLab around a year ago and has been 
 incubated as its own project to make sure it can be easily merged into 
 upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
 goals are similar to PySpark and shares a similar design pattern as described 
 in our meetup talk[2], Spark Summit presentation[3].
 Integrating SparkR into the Apache project will enable R users to use Spark 
 out of the box and given R’s large user base, it will help the Spark project 
 reach more users.  Additionally, work in progress features like providing R 
 integration with ML Pipelines and Dataframes can be better achieved by 
 development in a unified code base.
 SparkR is available under the Apache 2.0 License and does not have any 
 external dependencies other than requiring users to have R and Java installed 
 on their machines.  SparkR’s developers come from many organizations 
 including UC Berkeley, Alteryx, Intel and we will support future development, 
 maintenance after the integration.
 [1] https://github.com/amplab-extras/SparkR-pkg
 [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
 [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5628) Add option to return spark-ec2 version

2015-02-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5628:
--
Assignee: Nicholas Chammas

 Add option to return spark-ec2 version
 --

 Key: SPARK-5628
 URL: https://issues.apache.org/jira/browse/SPARK-5628
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0, 1.4.0


 We need a {{--version}} option for {{spark-ec2}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5531) Spark download .tgz file does not get unpacked

2015-02-06 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309938#comment-14309938
 ] 

DeepakVohra commented on SPARK-5531:


Why two options, Direct Download and Select Apache Mirror, if both direct to 
the same HTML page?

 Spark download .tgz file does not get unpacked
 --

 Key: SPARK-5531
 URL: https://issues.apache.org/jira/browse/SPARK-5531
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Linux
Reporter: DeepakVohra

 The spark-1.2.0-bin-cdh4.tgz file downloaded from 
 http://spark.apache.org/downloads.html does not get unpacked.
 tar xvf spark-1.2.0-bin-cdh4.tgz
 gzip: stdin: not in gzip format
 tar: Child returned status 1
 tar: Error is not recoverable: exiting now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5371) SparkSQL Fails to parse Query with UNION ALL in subquery

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5371:
-
Component/s: SQL

 SparkSQL Fails to parse Query with UNION ALL in subquery
 

 Key: SPARK-5371
 URL: https://issues.apache.org/jira/browse/SPARK-5371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross

 This SQL session:
 {code}
 DROP TABLE
 test1;
 DROP TABLE
 test2;
 CREATE TABLE
 test1
 (
 c11 INT,
 c12 INT,
 c13 INT,
 c14 INT
 );
 CREATE TABLE
 test2
 (
 c21 INT,
 c22 INT,
 c23 INT,
 c24 INT
 );
 SELECT
 MIN(t3.c_1),
 MIN(t3.c_2),
 MIN(t3.c_3),
 MIN(t3.c_4)
 FROM
 (
 SELECT
 SUM(t1.c11) c_1,
 NULLc_2,
 NULLc_3,
 NULLc_4
 FROM
 test1 t1
 UNION ALL
 SELECT
 NULLc_1,
 SUM(t2.c22) c_2,
 SUM(t2.c23) c_3,
 SUM(t2.c24) c_4
 FROM
 test2 t2 ) t3; 
 {code}
 Produces this error:
 {code}
 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
 query 'SELECT
 MIN(t3.c_1),
 MIN(t3.c_2),
 MIN(t3.c_3),
 MIN(t3.c_4)
 FROM
 (
 SELECT
 SUM(t1.c11) c_1,
 NULLc_2,
 NULLc_3,
 NULLc_4
 FROM
 test1 t1
 UNION ALL
 SELECT
 NULLc_1,
 SUM(t2.c22) c_2,
 SUM(t2.c23) c_3,
 SUM(t2.c24) c_4
 FROM
 test2 t2 ) t3'
 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
 MIN(t3.c_1),
 MIN(t3.c_2),
 MIN(t3.c_3),
 MIN(t3.c_4)
 FROM
 (
 SELECT
 SUM(t1.c11) c_1,
 NULLc_2,
 NULLc_3,
 NULLc_4
 FROM
 test1 t1
 UNION ALL
 SELECT
 NULLc_1,
 SUM(t2.c22) c_2,
 SUM(t2.c23) c_3,
 SUM(t2.c24) c_4
 FROM
 test2 t2 ) t3
 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
 executing query:
 java.util.NoSuchElementException: key not found: c_2#23488
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
   at 
 org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 

[jira] [Updated] (SPARK-4854) Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4854:
-
Component/s: SQL

 Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI
 -

 Key: SPARK-4854
 URL: https://issues.apache.org/jira/browse/SPARK-4854
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1
Reporter: Shenghua Wan

 Hello, 
 I met a problem when using Spark sql CLI. A custom UDTF with lateral view 
 throws ClassNotFound exception. I did a couple of experiments in same 
 environment (spark version 1.1.0, 1.1.1): 
 select + same custom UDTF (Passed) 
 select + lateral view + custom UDTF (ClassNotFoundException) 
 select + lateral view + built-in UDTF (Passed) 
 I have done some googling there days and found one related issue ticket of 
 Spark 
 https://issues.apache.org/jira/browse/SPARK-4811
 which is about Custom UDTFs not working in Spark SQL. 
 It should be helpful to put actual code here to reproduce the problem. 
 However,  corporate regulations might prohibit this. So sorry about this. 
 Directly using explode's source code in a jar will help anyway. 
 Here is a portion of stack print when exception, just in case: 
 java.lang.ClassNotFoundException: XXX 
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) 
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
 at java.security.AccessController.doPrivileged(Native Method) 
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425) 
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358) 
 at 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:81)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.createFunction(hiveUdfs.scala:247) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:254)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:254) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors$lzycompute(hiveUdfs.scala:261)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors(hiveUdfs.scala:260)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:265)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:265) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:269) 
 at 
 org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50)
  
 at scala.Option.map(Option.scala:145) 
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:50)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:60)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79)
  
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  
 at scala.collection.immutable.List.foreach(List.scala:318) 
 at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) 
 at 
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) 
 the rest is omitted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5427) Add support for floor function in Spark SQL

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5427:
-
Component/s: SQL

 Add support for floor function in Spark SQL
 ---

 Key: SPARK-5427
 URL: https://issues.apache.org/jira/browse/SPARK-5427
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Ted Yu

 floor() function is supported in Hive SQL.
 This issue is to add floor() function to Spark SQL.
 Related thread: http://search-hadoop.com/m/JW1q563fc22



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5314) java.lang.OutOfMemoryError in SparkSQL with GROUP BY

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5314:
-
Component/s: SQL

 java.lang.OutOfMemoryError in SparkSQL with GROUP BY
 

 Key: SPARK-5314
 URL: https://issues.apache.org/jira/browse/SPARK-5314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Alex Baretta

 I am running a SparkSQL GROUP BY query on a largish Parquet table (a few 
 hundred million rows), weighing it at about 50GB. My cluster has 1.7 TB of 
 RAM, so it should have more than plenty resources to cope with this query.
 WARN TaskSetManager: Lost task 279.0 in stage 22.0 (TID 1229, 
 ds-model-w-21.c.eastern-gravity-771.internal): java.lang.OutOfMemoryError: GC 
 overhead limit exceeded
 at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
 at scala.collection.AbstractSeq.distinct(Seq.scala:40)
 at 
 org.apache.spark.sql.catalyst.expressions.Coalesce.resolved$lzycompute(nullFunctions.scala:33)
 at 
 org.apache.spark.sql.catalyst.expressions.Coalesce.resolved(nullFunctions.scala:33)
 at 
 org.apache.spark.sql.catalyst.expressions.Coalesce.dataType(nullFunctions.scala:37)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:100)
 at 
 org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:101)
 at 
 org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)
 at 
 org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:81)
 at 
 org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571)
 at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
 at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5129) make SqlContext support select date +/- XX DAYS from table

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5129:
-
Component/s: SQL

 make SqlContext support select date +/- XX DAYS from table  
 --

 Key: SPARK-5129
 URL: https://issues.apache.org/jira/browse/SPARK-5129
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: DoingDone9
Priority: Minor

 Example :
 create table test (date: Date)
 2014-01-01
 2014-01-02
 2014-01-03
 when  running select date + 10 DAYS from test, i want get
 2014-01-11 
 2014-01-12
 2014-01-13
 and running select date - 10 DAYS from test,  get
 2013-12-22
 2013-12-23
 2013-12-24



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5001:
-
Component/s: Streaming

 BlockRDD removed unreasonablly in streaming
 ---

 Key: SPARK-5001
 URL: https://issues.apache.org/jira/browse/SPARK-5001
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: hanhonggen
 Attachments: 
 fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch


 I've counted messages using kafkainputstream of spark-1.1.1. The test app 
 failed when the latter batch job completed sooner than the previous. In the 
 source code, BlockRDDs older than (time-rememberDuration) will be removed in 
 cleanMetaData after one job completed. And the previous job will abort due to 
 block not found.The relevant log are as follows:
 2014-12-25 
 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
 :Starting job streaming job 1419487632000 ms.0 from job set of time 
 1419487632000 ms
 2014-12-25 
 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
 :Starting job streaming job 1419487635000 ms.0 from job set of time 
 1419487635000 ms
 2014-12-25 
 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
 :Finished job streaming job 1419487635000 ms.0 from job set of time 
 1419487635000 ms
 2014-12-25 
 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
 :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
 of time 1419487635000 ms from DStream clearMetadata
 java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
 found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2802) Improve the Cassandra sample and Add a new sample for Streaming to Cassandra

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2802:
-
Component/s: Streaming

 Improve the Cassandra sample and Add a new sample for Streaming to Cassandra
 

 Key: SPARK-2802
 URL: https://issues.apache.org/jira/browse/SPARK-2802
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Helena Edelson
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5066:
-
Component/s: Streaming

 Can not get all key that has same hashcode  when reading key ordered  from 
 different Streaming.
 ---

 Key: SPARK-5066
 URL: https://issues.apache.org/jira/browse/SPARK-5066
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: DoingDone9
Priority: Critical

 when spill is open, data ordered by hashCode will be spilled to disk. We need 
 get all key that has the same hashCode from different tmp files when merge 
 value, but it just read the key that has the minHashCode that in a tmp file, 
 we can not read all key.
 Example :
 If file1 has [k1, k2, k3], file2 has [k4,k5,k1].
 And hashcode of k4  hashcode of k5  hashcode of k1   hashcode of k2   
 hashcode of k3
 we just  read k1 from file1 and k4 from file2. Can not read all k1.
 Code :
 private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it = 
 it.buffered)
 inputStreams.foreach { it =
   val kcPairs = new ArrayBuffer[(K, C)]
   readNextHashCode(it, kcPairs)
   if (kcPairs.length  0) {
 mergeHeap.enqueue(new StreamBuffer(it, kcPairs))
   }
 }
  private def readNextHashCode(it: BufferedIterator[(K, C)], buf: 
 ArrayBuffer[(K, C)]): Unit = {
   if (it.hasNext) {
 var kc = it.next()
 buf += kc
 val minHash = hashKey(kc)
 while (it.hasNext  it.head._1.hashCode() == minHash) {
   kc = it.next()
   buf += kc
 }
   }
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5615) Fix testPackage in StreamingContextSuite

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5615:
-
Component/s: Streaming

 Fix testPackage in StreamingContextSuite
 

 Key: SPARK-5615
 URL: https://issues.apache.org/jira/browse/SPARK-5615
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Liang-Chi Hsieh
Priority: Minor

 testPackage in StreamingContextSuite often throws SparkException because its 
 ssc is not shut down gracefully. Not affect the unit test but I think we can 
 make it graceful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4174) Streaming: Optionally provide notifications to Receivers when DStream has been generated

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4174:
-
Component/s: Streaming

 Streaming: Optionally provide notifications to Receivers when DStream has 
 been generated
 

 Key: SPARK-4174
 URL: https://issues.apache.org/jira/browse/SPARK-4174
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan

 Receivers receiving data from Message Queues, like Active MQ, Kafka etc can 
 replay messages if required. Using the HDFS WAL mechanism for such systems 
 affects efficiency as we are incurring an unnecessary HDFS write when we can 
 recover the data from the queue anyway.
 We can fix this by providing a notification to the receiver when the RDD is 
 generated from the blocks. We need to consider the case where a receiver 
 might fail before the RDD is generated and come back on a different executor 
 when the RDD is generated. Either way, this is likely to cause duplicates and 
 not data loss -- so we may be ok.
 I am thinking about something of the order of accepting a callback function 
 which gets called when the RDD is generated. We can keep the function local 
 in a map of batch id - function, which gets called when the function gets 
 generated (we can inform the ReceiverSupervisorImpl via Akka when the driver 
 generates the RDD). Of course, just an early thought - I will work on a 
 design doc for this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4874) Report number of records read/written in a task

2015-02-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4874:
---
Component/s: Web UI
 Spark Core

 Report number of records read/written in a task
 ---

 Key: SPARK-4874
 URL: https://issues.apache.org/jira/browse/SPARK-4874
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis
 Fix For: 1.3.0


 This metric will help us find key skew using the WebUI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4874) Report number of records read/written in a task

2015-02-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4874.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0

 Report number of records read/written in a task
 ---

 Key: SPARK-4874
 URL: https://issues.apache.org/jira/browse/SPARK-4874
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis
 Fix For: 1.3.0


 This metric will help us find key skew using the WebUI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4541) Add --version to spark-submit

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4541:
-
Component/s: Spark Submit

 Add --version to spark-submit
 -

 Key: SPARK-4541
 URL: https://issues.apache.org/jira/browse/SPARK-4541
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Reporter: Arun Ahuja
Priority: Minor

 On a lot of the release testing and discussion on the JIRA the question of 
 what spark version users are using and how to verify the version comes up. 
 Can we 
 1) Add a flag to spark-submit that tells the version.
 2) Log a version/last commit in the logs as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310130#comment-14310130
 ] 

Sandy Ryza edited comment on SPARK-4550 at 2/6/15 11:13 PM:


I got a working prototype and benchmarked the ExternalSorter changes on my 
laptop.

Each run inserts a bunch of records, each a (Int, (10-character string, Int)) 
tuple, into an ExternalSorter and then calls writePartitionedFile.  The 
reported memory size is the sum of the shuffle bytes spilled (mem) metric and 
the remaining size of the collection after insertion has completed.  Results 
are averaged over three runs.

Keep in mind that the primary goal here is to reduce GC pressure, so any speed 
improvements are icing.

||Number of records||Storing as Serialized||Memory Size||Number of 
Spills||Insert Time (ms)||Write Time (ms)||Total Time||
|1 million|false|194923217|0|1123|3442|4566|
|1 million|true|48694072|0|1315|2652|3967|
|10 million|false|2050514159|3|26723|17418|44141|
|10 million|true|613614392|1|16501|17151|33652|
|10 million|false|10166122563|17|101831|89960|191791|
|10 million|true|3067937592|5|76801|78361|155161|


was (Author: sandyr):
I got a working prototype and benchmarked the ExternalSorter changes on my 
laptop.

Each run inserts a bunch of records, each a (Int, (10-character string, Int)) 
tuple, into an ExternalSorter and then calls writePartitionedFile.  The 
reported memory size is the sum of the shuffle bytes spilled (mem) metric and 
the remaining size of the collection after insertion has completed.  Results 
are averaged over three runs.

Keep in mind that the primary goal here is to reduce GC pressure, so any speed 
improvements are icing.

||Number of records||Storing as Serialized||Memory Size||Number of 
Spills||Insert Time(ms)||Write Time (ms)||Total Time||
|1 million|false|194923217|0|1123|3442|4566|
|1 million|true|48694072|0|1315|2652|3967|
|10 million|false|2050514159|3|26723|17418|44141|
|10 million|true|613614392|1|16501|17151|33652|
|10 million|false|10166122563|17|101831|89960|191791|
|10 million|true|3067937592|5|76801|78361|155161|

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-06 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310130#comment-14310130
 ] 

Sandy Ryza edited comment on SPARK-4550 at 2/6/15 11:13 PM:


I got a working prototype and benchmarked the ExternalSorter changes on my 
laptop.

Each run inserts a bunch of records, each a (Int, (10-character string, Int)) 
tuple, into an ExternalSorter and then calls writePartitionedFile.  The 
reported memory size is the sum of the shuffle bytes spilled (mem) metric and 
the remaining size of the collection after insertion has completed.  Results 
are averaged over three runs.

Keep in mind that the primary goal here is to reduce GC pressure, so any speed 
improvements are icing.

||Number of Records||Storing as Serialized||Memory Size||Number of 
Spills||Insert Time (ms)||Write Time (ms)||Total Time||
|1 million|false|194923217|0|1123|3442|4566|
|1 million|true|48694072|0|1315|2652|3967|
|10 million|false|2050514159|3|26723|17418|44141|
|10 million|true|613614392|1|16501|17151|33652|
|10 million|false|10166122563|17|101831|89960|191791|
|10 million|true|3067937592|5|76801|78361|155161|


was (Author: sandyr):
I got a working prototype and benchmarked the ExternalSorter changes on my 
laptop.

Each run inserts a bunch of records, each a (Int, (10-character string, Int)) 
tuple, into an ExternalSorter and then calls writePartitionedFile.  The 
reported memory size is the sum of the shuffle bytes spilled (mem) metric and 
the remaining size of the collection after insertion has completed.  Results 
are averaged over three runs.

Keep in mind that the primary goal here is to reduce GC pressure, so any speed 
improvements are icing.

||Number of records||Storing as Serialized||Memory Size||Number of 
Spills||Insert Time (ms)||Write Time (ms)||Total Time||
|1 million|false|194923217|0|1123|3442|4566|
|1 million|true|48694072|0|1315|2652|3967|
|10 million|false|2050514159|3|26723|17418|44141|
|10 million|true|613614392|1|16501|17151|33652|
|10 million|false|10166122563|17|101831|89960|191791|
|10 million|true|3067937592|5|76801|78361|155161|

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5658) Finalize DDL and write support APIs

2015-02-06 Thread Yin Huai (JIRA)
Yin Huai created SPARK-5658:
---

 Summary: Finalize DDL and write support APIs
 Key: SPARK-5658
 URL: https://issues.apache.org/jira/browse/SPARK-5658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5658) Finalize DDL and write support APIs

2015-02-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310194#comment-14310194
 ] 

Apache Spark commented on SPARK-5658:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4446

 Finalize DDL and write support APIs
 ---

 Key: SPARK-5658
 URL: https://issues.apache.org/jira/browse/SPARK-5658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4983) Add sleep() before tagging EC2 instances to allow instance metadata to propagate

2015-02-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4983:
--
Assignee: Gen TANG

 Add sleep() before tagging EC2 instances to allow instance metadata to 
 propagate
 

 Key: SPARK-4983
 URL: https://issues.apache.org/jira/browse/SPARK-4983
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Assignee: Gen TANG
Priority: Minor
  Labels: starter
 Fix For: 1.3.0, 1.2.2


 We launch EC2 instances in {{spark-ec2}} and then immediately tag them in a 
 separate boto call. Sometimes, EC2 doesn't get enough time to propagate 
 information about the just-launched instances, so when we go to tag them we 
 get a server that doesn't know about them yet.
 This yields the following type of error:
 {code}
 Launching instances...
 Launched 1 slaves in us-east-1b, regid = r-cf780321
 Launched master in us-east-1b, regid = r-da7e0534
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1284, in module
 main()
   File ./ec2/spark_ec2.py, line 1276, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1122, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ./ec2/spark_ec2.py, line 646, in launch_cluster
 value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in 
 create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File .../spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-585219a6' does not 
 exist/Message/Error/ErrorsRequestIDb9f1ad6e-59b9-47fd-a693-527be1f779eb/RequestID/Response
 {code}
 The solution is to tag the instances in the same call that launches them, or 
 less desirably, tag the instances after some short wait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4983) Add sleep() before tagging EC2 instances to allow instance metadata to propagate

2015-02-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4983:
--
Summary: Add sleep() before tagging EC2 instances to allow instance 
metadata to propagate  (was: Tag EC2 instances in the same call that launches 
them)

 Add sleep() before tagging EC2 instances to allow instance metadata to 
 propagate
 

 Key: SPARK-4983
 URL: https://issues.apache.org/jira/browse/SPARK-4983
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor
  Labels: starter
 Fix For: 1.3.0, 1.2.2


 We launch EC2 instances in {{spark-ec2}} and then immediately tag them in a 
 separate boto call. Sometimes, EC2 doesn't get enough time to propagate 
 information about the just-launched instances, so when we go to tag them we 
 get a server that doesn't know about them yet.
 This yields the following type of error:
 {code}
 Launching instances...
 Launched 1 slaves in us-east-1b, regid = r-cf780321
 Launched master in us-east-1b, regid = r-da7e0534
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1284, in module
 main()
   File ./ec2/spark_ec2.py, line 1276, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1122, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ./ec2/spark_ec2.py, line 646, in launch_cluster
 value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in 
 create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File .../spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-585219a6' does not 
 exist/Message/Error/ErrorsRequestIDb9f1ad6e-59b9-47fd-a693-527be1f779eb/RequestID/Response
 {code}
 The solution is to tag the instances in the same call that launches them, or 
 less desirably, tag the instances after some short wait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5369) remove allocatedHostToContainersMap.synchronized in YarnAllocator

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5369:
-
Component/s: YARN

 remove allocatedHostToContainersMap.synchronized in YarnAllocator
 -

 Key: SPARK-5369
 URL: https://issues.apache.org/jira/browse/SPARK-5369
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang

 as SPARK-1714 mentioned, because YarnAllocator.allocateResources is a 
 synchronized method, we can remove allocatedHostToContainersMap.synchronized  
 in YarnAllocator.allocateResources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4360) task only execute on one node when spark on yarn

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4360:
-
Component/s: YARN

 task only execute on one node when spark on yarn
 

 Key: SPARK-4360
 URL: https://issues.apache.org/jira/browse/SPARK-4360
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: seekerak

 hadoop version: hadoop 2.0.3-alpha
 spark version: 1.0.2
 when i run spark jobs on yarn, i found all the task only run on one node, my 
 cluster has 4 nodes, executors has 3, but only one has task, the others 
 hasn't, my command like this :
 /opt/hadoopcluster/spark-1.0.2-bin-hadoop2/bin/spark-submit --class 
 org.sr.scala.Spark_LineCount_G0 --executor-memory 2G --num-executors 12 
 --master yarn-cluster /home/Spark_G0.jar /data /output/ou_1
 is there any one knows why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2971:
-
Component/s: YARN

 Orphaned YARN ApplicationMaster lingers forever
 ---

 Key: SPARK-2971
 URL: https://issues.apache.org/jira/browse/SPARK-2971
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise
Reporter: Shay Rojansky

 We have cases where if CTRL-C is hit during a Spark job startup, a YARN 
 ApplicationMaster is created but cannot connect to the driver (presumably 
 because the driver has terminated). Once an AM enters this state it never 
 exits it, and has to be manually killed in YARN.
 Here's an excerpt from the AM logs:
 {noformat}
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji
 14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(roji)
 14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started
 14/08/11 16:29:40 INFO Remoting: Starting remoting
 14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
 14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
 14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at 
 master.grid.eaglerd.local/192.168.41.100:8030
 14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: 
 appattempt_1407759736957_0014_01
 14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster
 14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be 
 reachable.
 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
 master.grid.eaglerd.local:44911, retrying ...
 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
 master.grid.eaglerd.local:44911, retrying ...
 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
 master.grid.eaglerd.local:44911, retrying ...
 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
 master.grid.eaglerd.local:44911, retrying ...
 14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
 master.grid.eaglerd.local:44911, retrying ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4346) YarnClientSchedulerBack.asyncMonitorApplication should be common with Client.monitorApplication

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4346:
-
Component/s: YARN

 YarnClientSchedulerBack.asyncMonitorApplication should be common with 
 Client.monitorApplication
 ---

 Key: SPARK-4346
 URL: https://issues.apache.org/jira/browse/SPARK-4346
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Thomas Graves

 The YarnClientSchedulerBackend.asyncMonitorApplication routine should move 
 into ClientBase and be made common with monitorApplication.  Make sure stop 
 is handled properly.
 See discussion on https://github.com/apache/spark/pull/3143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4941) Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4941:
-
Component/s: YARN

 Yarn cluster mode does not upload all needed jars to driver node (Spark 1.2.0)
 --

 Key: SPARK-4941
 URL: https://issues.apache.org/jira/browse/SPARK-4941
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Gurpreet Singh

 I am specifying additional jars and config xml file with --jars and --files 
 option to be uploaded to driver in the following spark-submit command. 
 However they are not getting uploaded.
 This results in the the job failure. It was working in spark 1.0.2 build.
 Spark-Build being used (spark-1.2.0.tgz)
 
 $SPARK_HOME/bin/spark-submit \
 --class com.ebay.inc.scala.testScalaXML \
 --driver-class-path 
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar:/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar:/apache/hadoop/share/hadoop/common/lib/guava-11.0.2.jar
  \
 --master yarn \
 --deploy-mode cluster \
 --num-executors 3 \
 --driver-memory 1G  \
 --executor-memory 1G \
 /export/home/b_incdata_rw/gurpreetsingh/jar/testscalaxml_2.11-1.0.jar 
 /export/home/b_incdata_rw/gurpreetsingh/sqlFramework.xml next_gen_linking \
 --queue hdmi-spark \
 --jars 
 /export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-api-jdo-3.2.1.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-core-3.2.2.jar,/export/home/b_incdata_rw/gurpreetsingh/jar/datanucleus-rdbms-3.2.1.jar,/apache/hive/lib/mysql-connector-java-5.0.8-bin.jar,/apache/hadoop/share/hadoop/common/lib/hadoop--0.1--2.jar,/apache/hadoop/share/hadoop/common/lib/hadoop-lzo-0.6.0.jar,/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1--2.jar\
 --files 
 /export/home/b_incdata_rw/gurpreetsingh/spark-1.0.2-bin-2.4.1/conf/hive-site.xml
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 14/12/22 23:00:17 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
 to rm2
 14/12/22 23:00:17 INFO yarn.Client: Requesting a new application from cluster 
 with 2026 NodeManagers
 14/12/22 23:00:17 INFO yarn.Client: Verifying our application has not 
 requested more than the maximum memory capability of the cluster (16384 MB 
 per container)
 14/12/22 23:00:17 INFO yarn.Client: Will allocate AM container, with 1408 MB 
 memory including 384 MB overhead
 14/12/22 23:00:17 INFO yarn.Client: Setting up container launch context for 
 our AM
 14/12/22 23:00:17 INFO yarn.Client: Preparing resources for our AM container
 14/12/22 23:00:18 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 14/12/22 23:00:18 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded.
 14/12/22 23:00:21 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
 6623380 for b_incdata_rw on 10.115.201.75:8020
 14/12/22 23:00:21 INFO yarn.Client: 
 Uploading resource 
 file:/home/b_incdata_rw/gurpreetsingh/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar
  - 
 hdfs://-nn.vip.xxx.com:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/spark-assembly-1.2.0-hadoop2.4.0.jar
 14/12/22 23:00:24 INFO yarn.Client: Uploading resource 
 file:/export/home/b_incdata_rw/gurpreetsingh/jar/firstsparkcode_2.11-1.0.jar 
 - 
 hdfs://-nn.vip.xxx.com:8020:8020/user/b_incdata_rw/.sparkStaging/application_1419242629195_8432/firstsparkcode_2.11-1.0.jar
 14/12/22 23:00:25 INFO yarn.Client: Setting up the launch environment for our 
 AM container



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4492) Exception when following SimpleApp tutorial java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4492.
--
Resolution: Not a Problem

 Exception when following SimpleApp tutorial java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
 --

 Key: SPARK-4492
 URL: https://issues.apache.org/jira/browse/SPARK-4492
 Project: Spark
  Issue Type: Bug
Reporter: sam

 When I follow the example here 
 https://spark.apache.org/docs/1.0.2/quick-start.html and run with java -cp 
 my.jar my.main.Class with master set to yarn-client I get the below 
 exception.
 Exception in thread main java.lang.ExceptionInInitializerError
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at com.barclays.SimpleApp$.main(SimpleApp.scala:11)
   at com.barclays.SimpleApp.main(SimpleApp.scala)
 Caused by: org.apache.spark.SparkException: Unable to load YARN support
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:106)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:101)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   ... 3 more
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:169)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:102)
   ... 5 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5259) Fix endless retry stage by add task equal() and hashcode() to avoid stage.pendingTasks not empty while stage map output is available

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5259:
-
Fix Version/s: (was: 1.2.0)

 Fix endless retry stage by add task equal() and hashcode() to avoid 
 stage.pendingTasks not empty while stage map output is available 
 -

 Key: SPARK-5259
 URL: https://issues.apache.org/jira/browse/SPARK-5259
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.1, 1.2.0
Reporter: SuYan

 1. while shuffle stage was retry, there may have 2 taskSet running. 
 we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
 re-run taskSet0.0's un-complete task
 if taskSet0.0 was run all the task that the taskSet0.1 not complete yet but 
 covered the partitions.
 then stage is Available is true.
 {code}
   def isAvailable: Boolean = {
 if (!isShuffleMap) {
   true
 } else {
   numAvailableOutputs == numPartitions
 }
   } 
 {code}
 but stage.pending task is not empty, to protect register mapStatus in 
 mapOutputTracker.
 because if task is complete success, pendingTasks is minus Task in 
 reference-level because the task is not override hashcode() and equals()
 pendingTask -= task
 but numAvailableOutputs is according to partitionID.
 here is the testcase to prove:
 {code}
   test(Make sure mapStage.pendingtasks is set()  +
 while MapStage.isAvailable is true while stage was retry ) {
 val firstRDD = new MyRDD(sc, 6, Nil)
 val firstShuffleDep = new ShuffleDependency(firstRDD, null)
 val firstShuyffleId = firstShuffleDep.shuffleId
 val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
 val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
 val shuffleId = shuffleDep.shuffleId
 val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
 submit(reduceRdd, Array(0, 1))
 complete(taskSets(0), Seq(
   (Success, makeMapStatus(hostB, 1)),
   (Success, makeMapStatus(hostB, 2)),
   (Success, makeMapStatus(hostC, 3)),
   (Success, makeMapStatus(hostB, 4)),
   (Success, makeMapStatus(hostB, 5)),
   (Success, makeMapStatus(hostC, 6))
 ))
 complete(taskSets(1), Seq(
   (Success, makeMapStatus(hostA, 1)),
   (Success, makeMapStatus(hostB, 2)),
   (Success, makeMapStatus(hostA, 1)),
   (Success, makeMapStatus(hostB, 2)),
   (Success, makeMapStatus(hostA, 1))
 ))
 runEvent(ExecutorLost(exec-hostA))
 runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
 null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
 null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(0),
   FetchFailed(null, firstShuyffleId, -1, 0, Fetch Mata data failed),
   null, null, null, null))
 scheduler.resubmitFailedStages()
 runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
   makeMapStatus(hostC, 1), null, null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
   makeMapStatus(hostC, 1), null, null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
   makeMapStatus(hostC, 1), null, null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
   makeMapStatus(hostB, 2), null, null, null))
 val stage = scheduler.stageIdToStage(taskSets(1).stageId)
 assert(stage.attemptId == 2)
 assert(stage.isAvailable)
 assert(stage.pendingTasks.size == 0)
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4900) MLlib SingularValueDecomposition ARPACK IllegalStateException

2015-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309981#comment-14309981
 ] 

Sean Owen commented on SPARK-4900:
--

Do you have any more info, like how to reproduce this? what were you computing?

 MLlib SingularValueDecomposition ARPACK IllegalStateException 
 --

 Key: SPARK-4900
 URL: https://issues.apache.org/jira/browse/SPARK-4900
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1, 1.2.0
 Environment: Ubuntu 1410, Java HotSpot(TM) 64-Bit Server VM (build 
 25.25-b02, mixed mode)
 spark local mode
Reporter: Mike Beyer

 java.lang.reflect.InvocationTargetException
 ...
 Caused by: java.lang.IllegalStateException: ARPACK returns non-zero info = 3 
 Please refer ARPACK user guide for error message.
 at 
 org.apache.spark.mllib.linalg.EigenValueDecomposition$.symmetricEigs(EigenValueDecomposition.scala:120)
 at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:235)
 at 
 org.apache.spark.mllib.linalg.distributed.RowMatrix.computeSVD(RowMatrix.scala:171)
   ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5525) [SPARK][SQL]

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5525.
--
Resolution: Invalid

 [SPARK][SQL]
 

 Key: SPARK-5525
 URL: https://issues.apache.org/jira/browse/SPARK-5525
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2722) Mechanism for escaping spark configs is not consistent

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2722:
-
Component/s: Spark Core

 Mechanism for escaping spark configs is not consistent
 --

 Key: SPARK-2722
 URL: https://issues.apache.org/jira/browse/SPARK-2722
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or

 Currently, you can specify a spark config in spark-defaults.conf as follows:
 {code}
 spark.magic Mr. Johnson
 {code}
 and this will preserve the double quotes as part of the string. Naturally, if 
 you want to do the equivalent in spark.*.extraJavaOptions, you would use the 
 following:
 {code}
 spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\
 {code}
 However, this fails because the backslashes go away and it tries to interpret 
 Johnson as the main class argument. Instead, you have to do the following:
 {code}
 spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\
 {code}
 which is not super intuitive.
 Note that this only applies to standalone mode. In YARN it's not even 
 possible to use quoted strings in config values (SPARK-2718).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5601) Make streaming algorithms Java-friendly

2015-02-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5601.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4432
[https://github.com/apache/spark/pull/4432]

 Make streaming algorithms Java-friendly
 ---

 Key: SPARK-5601
 URL: https://issues.apache.org/jira/browse/SPARK-5601
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 Streaming algorithms take DStream. We should also support JavaDStream and 
 JavaPairDStream for Java users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5586) Automatically provide sqlContext in Spark shell

2015-02-06 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5586.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4387
[https://github.com/apache/spark/pull/4387]

 Automatically provide sqlContext in Spark shell
 ---

 Key: SPARK-5586
 URL: https://issues.apache.org/jira/browse/SPARK-5586
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, SQL
Reporter: Patrick Wendell
Assignee: shengli
Priority: Blocker
 Fix For: 1.3.0


 A simple patch, but we should create a sqlContext (and, if supported by the 
 build, a Hive context) in the Spark shell when it's created, and import the 
 DSL. We can just call it sqlContext. This would save us so much time writing 
 code examples :P



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4983) Tag EC2 instances in the same call that launches them

2015-02-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4983.
---
   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0

Issue resolved by pull request 3986
[https://github.com/apache/spark/pull/3986]

 Tag EC2 instances in the same call that launches them
 -

 Key: SPARK-4983
 URL: https://issues.apache.org/jira/browse/SPARK-4983
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Priority: Minor
  Labels: starter
 Fix For: 1.3.0, 1.2.2


 We launch EC2 instances in {{spark-ec2}} and then immediately tag them in a 
 separate boto call. Sometimes, EC2 doesn't get enough time to propagate 
 information about the just-launched instances, so when we go to tag them we 
 get a server that doesn't know about them yet.
 This yields the following type of error:
 {code}
 Launching instances...
 Launched 1 slaves in us-east-1b, regid = r-cf780321
 Launched master in us-east-1b, regid = r-da7e0534
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1284, in module
 main()
   File ./ec2/spark_ec2.py, line 1276, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1122, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ./ec2/spark_ec2.py, line 646, in launch_cluster
 value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in 
 create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File .../spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-585219a6' does not 
 exist/Message/Error/ErrorsRequestIDb9f1ad6e-59b9-47fd-a693-527be1f779eb/RequestID/Response
 {code}
 The solution is to tag the instances in the same call that launches them, or 
 less desirably, tag the instances after some short wait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-706) Failures in block manager put leads to task hanging

2015-02-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-706.
-
Resolution: Cannot Reproduce

 Failures in block manager put leads to task hanging
 ---

 Key: SPARK-706
 URL: https://issues.apache.org/jira/browse/SPARK-706
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 0.6.0, 0.6.1, 0.7.0, 0.6.2
Reporter: Reynold Xin

 Reported in this thread: 
 https://groups.google.com/forum/?fromgroups=#!topic/shark-users/Q_SiIDzVtZw
 The following exception in block manager leaves the block marked as pending. 
 {code}
 13/02/26 06:14:56 ERROR executor.Executor: Exception in task ID 39
 com.esotericsoftware.kryo.SerializationException: Buffer limit exceeded 
 writing object of type: shark.ColumnarWritable
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:492)
   at spark.KryoSerializationStream.writeObject(KryoSerializer.scala:78)
   at 
 spark.serializer.SerializationStream$class.writeAll(Serializer.scala:58)
   at spark.KryoSerializationStream.writeAll(KryoSerializer.scala:73)
   at spark.storage.DiskStore.putValues(DiskStore.scala:63)
   at spark.storage.BlockManager.dropFromMemory(BlockManager.scala:779)
   at spark.storage.MemoryStore.tryToPut(MemoryStore.scala:162)
   at spark.storage.MemoryStore.putValues(MemoryStore.scala:57)
   at spark.storage.BlockManager.put(BlockManager.scala:582)
   at spark.CacheTracker.getOrCompute(CacheTracker.scala:215)
   at spark.RDD.iterator(RDD.scala:159)
   at spark.scheduler.ResultTask.run(ResultTask.scala:18)
   at spark.executor.Executor$TaskRunner.run(Executor.scala:76)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:679)
 {code}
 When the block is read, the task is stuck in BlockInfo.waitForReady().
 We should propagate the error back to the master instead of hanging the slave 
 node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5600) Sort order of unfinished apps can be wrong in History Server

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5600.

  Resolution: Fixed
Assignee: Marcelo Vanzin
Target Version/s: 1.3.0

 Sort order of unfinished apps can be wrong in History Server
 

 Key: SPARK-5600
 URL: https://issues.apache.org/jira/browse/SPARK-5600
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.3.0


 The code that merges new logs with old logs sorts applications by their end 
 time only. Unfinished apps all have the same end time (-1), so the sort order 
 ends up being undefined.
 This was uncovered by the attempt to fix SPARK-5345 
 (https://github.com/apache/spark/pull/4133).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4994) Cleanup removed executors' ShuffleInfo in yarn shuffle service

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4994.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Lianhui Wang
Target Version/s: 1.3.0

 Cleanup removed executors' ShuffleInfo in yarn shuffle service
 --

 Key: SPARK-4994
 URL: https://issues.apache.org/jira/browse/SPARK-4994
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 when the application is completed, yarn's nodemanager can remove 
 application's local-dirs.but all executors' metadata of completed application 
 havenot be removed. now it let yarn ShuffleService to have much more memory 
 to store Executors' ShuffleInfo. so these metadata need to be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5531) Spark download .tgz file does not get unpacked

2015-02-06 Thread DeepakVohra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310147#comment-14310147
 ] 

DeepakVohra commented on SPARK-5531:


Thanks for updating the download links.

 Spark download .tgz file does not get unpacked
 --

 Key: SPARK-5531
 URL: https://issues.apache.org/jira/browse/SPARK-5531
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Linux
Reporter: DeepakVohra

 The spark-1.2.0-bin-cdh4.tgz file downloaded from 
 http://spark.apache.org/downloads.html does not get unpacked.
 tar xvf spark-1.2.0-bin-cdh4.tgz
 gzip: stdin: not in gzip format
 tar: Child returned status 1
 tar: Error is not recoverable: exiting now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2996) Standalone and Yarn have different settings for adding the user classpath first

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2996:
-
Priority: Major  (was: Minor)

 Standalone and Yarn have different settings for adding the user classpath 
 first
 ---

 Key: SPARK-2996
 URL: https://issues.apache.org/jira/browse/SPARK-2996
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 Standalone uses spark.files.userClassPathFirst while Yarn uses 
 spark.yarn.user.classpath.first. Adding support for the former in Yarn 
 should be pretty trivial.
 Don't know if Mesos has anything similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5531) Spark download .tgz file does not get unpacked

2015-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14310161#comment-14310161
 ] 

Sean Owen commented on SPARK-5531:
--

What do you mean? nobody changed the site. I'll recap:

- Apache provides mirrors for all its projects' distributions. 
http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0-bin-cdh4.tgz 
is a link to the mirror redirector from Apache.
- Spark also provides direct downloads from S3 (Cloudfront). That's the 
http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-cdh4.tgz link you get
- Note that choosing Direct Download or Mirror changes the hyperlink with 
Javascript. You don't see any change or go to a new page
- The archive in question appears correct in both places

 Spark download .tgz file does not get unpacked
 --

 Key: SPARK-5531
 URL: https://issues.apache.org/jira/browse/SPARK-5531
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Linux
Reporter: DeepakVohra

 The spark-1.2.0-bin-cdh4.tgz file downloaded from 
 http://spark.apache.org/downloads.html does not get unpacked.
 tar xvf spark-1.2.0-bin-cdh4.tgz
 gzip: stdin: not in gzip format
 tar: Child returned status 1
 tar: Error is not recoverable: exiting now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5531) Spark download .tgz file does not get unpacked

2015-02-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14309948#comment-14309948
 ] 

Sean Owen commented on SPARK-5531:
--

They don't. Look at the page again.

 Spark download .tgz file does not get unpacked
 --

 Key: SPARK-5531
 URL: https://issues.apache.org/jira/browse/SPARK-5531
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Linux
Reporter: DeepakVohra

 The spark-1.2.0-bin-cdh4.tgz file downloaded from 
 http://spark.apache.org/downloads.html does not get unpacked.
 tar xvf spark-1.2.0-bin-cdh4.tgz
 gzip: stdin: not in gzip format
 tar: Child returned status 1
 tar: Error is not recoverable: exiting now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3753) Spark hive join results in empty with shared hive context

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3753:
-
Component/s: SQL

 Spark hive join results in empty with shared hive context
 -

 Key: SPARK-3753
 URL: https://issues.apache.org/jira/browse/SPARK-3753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Hector Yee
Priority: Minor

 When I have two hive tables and do a join with the same hive context I get 
 the empty set
 e.g.
 val hc = new HiveContext(sc)
 val table1 = hc.sql(SELECT * from t1)
 val table2 = hc.sql(SELECT * from t2)
 val intersect = table1.join(table2).take(10)
 // empty set
 but this works if I do 
 val hc1 = new HiveContext(sc)
 val table1 = hc1.sql(SELECT * from t1)
 val hc2 = new HiveContext(sc)
 val table2 = hc2.sql(SELECT * from t2)
 val intersect = table1.join(table2).take(10)
 I am not sure if take is propagating up the take to table1 and table2 and 
 then doing the intersect (in the case of large tables that means no results) 
 or if it is some other problem with hive context.
 Doing the join in one SQL query also seems to result in the empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5109) Loading multiple parquet files into a single SchemaRDD

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5109:
-
Component/s: SQL

 Loading multiple parquet files into a single SchemaRDD
 --

 Key: SPARK-5109
 URL: https://issues.apache.org/jira/browse/SPARK-5109
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Sam Steingold

 {{[SQLContext.parquetFile(String)|http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/sql/SQLContext.html#parquetFile%28java.lang.String%29]}}
  accepts a comma-separated list of files to load.
 This feature prevents files with commas in its name (a rare use case, 
 admittedly), it is also an _extremely_ unusual feature.
 This feature should be deprecated and new methods
 {code}
 SQLContext.parquetFile(String[])
 SQLContext.parquetFile(ListString)
 {code} 
 should be added instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5444) 'spark.blockManager.port' conflict in netty service

2015-02-06 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5444.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: SaintBacchus
Target Version/s: 1.3.0

 'spark.blockManager.port' conflict in netty service
 ---

 Key: SPARK-5444
 URL: https://issues.apache.org/jira/browse/SPARK-5444
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.2.0
Reporter: SaintBacchus
Assignee: SaintBacchus
 Fix For: 1.3.0


 If set the 'spark.blockManager.port` = 4040 in spark-default.conf, it will 
 throw the conflict port exception and exit directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5041) hive-exec jar should be generated with JDK 6

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5041:
-
Component/s: SQL

 hive-exec jar should be generated with JDK 6
 

 Key: SPARK-5041
 URL: https://issues.apache.org/jira/browse/SPARK-5041
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Ted Yu
  Labels: jdk1.7, maven

 Shixiong Zhu first reported the issue where hive-exec-0.12.0-protobuf-2.5.jar 
 cannot be used by Spark program running JDK 6.
 See http://search-hadoop.com/m/JW1q5YLCNN
 hive-exec-0.12.0-protobuf-2.5.jar was generated with JDK 7. It should be 
 generated with JDK 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4985) Parquet support for date type

2015-02-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4985:
-
Component/s: SQL

 Parquet support for date type
 -

 Key: SPARK-4985
 URL: https://issues.apache.org/jira/browse/SPARK-4985
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang

 This is currently blocked by SPARK-4508



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >