[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-10-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-2630.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

 Input data size of CoalescedRDD is incorrect
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Andrew Ash
Priority: Blocker
 Fix For: 1.2.0

 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 {code}
 sc.textFile(text.4.3.G).coalesce(1).count()
 {code}
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-10-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-2630:
---

not merged yet, sorry.

 Input data size of CoalescedRDD is incorrect
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Andrew Ash
Priority: Blocker
 Fix For: 1.2.0

 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 {code}
 sc.textFile(text.4.3.G).coalesce(1).count()
 {code}
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...

2014-10-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157765#comment-14157765
 ] 

Ángel Álvarez commented on SPARK-2256:
--

It seems the problem has been solved in Spark 1.1.0 !!! 


 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take, windows
 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A
 
 
 
 
 
 
 
 
 
 

[jira] [Created] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-3775:


 Summary: Not suitable error message in spark-shell.cmd
 Key: SPARK-3775
 URL: https://issues.apache.org/jira/browse/SPARK-3775
 Project: Spark
  Issue Type: Improvement
Reporter: Masayoshi TSUZUKI
Priority: Trivial


In Windows environment.
When we execute bin\spark-shell.cmd before we build spark, we get the error 
message like this.

{quote}
Failed to find Spark assembly JAR.
You need to build Spark with sbt\sbt assembly before running this program.
{quote}

But this message is not suitable because ...
* Maven is also available to build Spark, and it works in Windows without 
cygwin now ([SPARK-3061]).
* The equivalent error message of linux version (bin/spark-shell) doesn't 
mention the way to build.
bq. You need to build Spark before running this program.
* sbt\sbt can't be executed in Windows without cygwin because it's bash script.

So this message should be modified as same as the linux version.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157818#comment-14157818
 ] 

Apache Spark commented on SPARK-3775:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/2640

 Not suitable error message in spark-shell.cmd
 -

 Key: SPARK-3775
 URL: https://issues.apache.org/jira/browse/SPARK-3775
 Project: Spark
  Issue Type: Improvement
Reporter: Masayoshi TSUZUKI
Priority: Trivial

 In Windows environment.
 When we execute bin\spark-shell.cmd before we build spark, we get the error 
 message like this.
 {quote}
 Failed to find Spark assembly JAR.
 You need to build Spark with sbt\sbt assembly before running this program.
 {quote}
 But this message is not suitable because ...
 * Maven is also available to build Spark, and it works in Windows without 
 cygwin now ([SPARK-3061]).
 * The equivalent error message of linux version (bin/spark-shell) doesn't 
 mention the way to build.
 bq. You need to build Spark before running this program.
 * sbt\sbt can't be executed in Windows without cygwin because it's bash 
 script.
 So this message should be modified as same as the linux version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3366) Compute best splits distributively in decision tree

2014-10-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3366.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2595
[https://github.com/apache/spark/pull/2595]

 Compute best splits distributively in decision tree
 ---

 Key: SPARK-3366
 URL: https://issues.apache.org/jira/browse/SPARK-3366
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Qiping Li
 Fix For: 1.2.0


 The current implementation computes all best splits locally on the driver, 
 which makes the driver a bottleneck for both communication and computation. 
 It would be nice if we can compute the best splits distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]

2014-10-03 Thread Renat Yusupov (JIRA)
Renat Yusupov created SPARK-3776:


 Summary: Wrong conversion to Catalyst for Option[Product]
 Key: SPARK-3776
 URL: https://issues.apache.org/jira/browse/SPARK-3776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Renat Yusupov
 Fix For: 1.2.0


Method ScalaReflection.convertToCatalyst make wrong conversion for 
Option[Product] data.
For example:
case class A(intValue: Int)
case class B(optionA: Option[A])
val b = B(Some(A(5)))
ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157913#comment-14157913
 ] 

Apache Spark commented on SPARK-3776:
-

User 'r3natko' has created a pull request for this issue:
https://github.com/apache/spark/pull/2641

 Wrong conversion to Catalyst for Option[Product]
 

 Key: SPARK-3776
 URL: https://issues.apache.org/jira/browse/SPARK-3776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Renat Yusupov
 Fix For: 1.2.0


 Method ScalaReflection.convertToCatalyst make wrong conversion for 
 Option[Product] data.
 For example:
 case class A(intValue: Int)
 case class B(optionA: Option[A])
 val b = B(Some(A(5)))
 ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

2014-10-03 Thread Brian Husted (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157936#comment-14157936
 ] 

Brian Husted commented on SPARK-2421:
-

To work around the problem, one must map the Writable to a String 
(org.apache.hadoop.io.Text in the case below).  This an issue when sorting 
large amounts of data since Spark will attempt to write out the entire dataset 
(spill) to perform the data conversion.  On a 500GB file this fills up more 
than 100GB of space on each node in our 12 node cluster which is very 
inefficient.  We are currently using Spark 1.0.2.  Any thoughts here are 
appreciated.

Our code that attempts to mimic map/reduce sort in Spark:

//read in the hadoop sequence file to sort
 val file = sc.sequenceFile(input, classOf[Text], classOf[Text])

//this is the code we would like to avoid that maps the Hadoop Text Input to 
Strings so the sortyByKey will run
 file.map{ case (k,v) = (k.toString(), v.toString())}

//perform the sort on the converted data
val sortedOutput = file.sortByKey(true, 1)

//write out the results as a sequence file
sortedOutput.saveAsSequenceFile(output, Some(classOf[DefaultCodec])) 

 Spark should treat writable as serializable for keys
 

 Key: SPARK-2421
 URL: https://issues.apache.org/jira/browse/SPARK-2421
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Java API
Affects Versions: 1.0.0
Reporter: Xuefu Zhang

 It seems that Spark requires the key be serializable (class implement 
 Serializable interface). In Hadoop world, Writable interface is used for the 
 same purpose. A lot of existing classes, while writable, are not considered 
 by Spark as Serializable. It would be nice if Spark can treate Writable as 
 serializable and automatically serialize and de-serialize these classes using 
 writable interface.
 This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3777) Display Executor ID for Tasks in Stage page

2014-10-03 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-3777:
---

 Summary: Display Executor ID for Tasks in Stage page
 Key: SPARK-3777
 URL: https://issues.apache.org/jira/browse/SPARK-3777
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0, 1.0.2, 1.0.0
Reporter: Shixiong Zhu
Priority: Minor


Now the Stage page only displays Executor(host) for tasks. However, there may 
be more than one Executors running in the same host. Currently, when some task 
is hung, I only know the host of the faulty executor. Therefore I have to check 
all executors in the host.

Adding Executor ID would be helpful to locate the faulty executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3777) Display Executor ID for Tasks in Stage page

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157944#comment-14157944
 ] 

Apache Spark commented on SPARK-3777:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/2642

 Display Executor ID for Tasks in Stage page
 -

 Key: SPARK-3777
 URL: https://issues.apache.org/jira/browse/SPARK-3777
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0, 1.0.2, 1.1.0
Reporter: Shixiong Zhu
Priority: Minor
  Labels: easy

 Now the Stage page only displays Executor(host) for tasks. However, there 
 may be more than one Executors running in the same host. Currently, when some 
 task is hung, I only know the host of the faulty executor. Therefore I have 
 to check all executors in the host.
 Adding Executor ID would be helpful to locate the faulty executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3764.
--
Resolution: Not a Problem

 Invalid dependencies of artifacts in Maven Central Repository.
 --

 Key: SPARK-3764
 URL: https://issues.apache.org/jira/browse/SPARK-3764
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Takuya Ueshin

 While testing my spark applications locally using spark artifacts downloaded 
 from Maven Central, the following exception was thrown:
 {quote}
 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
 Thread[Executor task launch worker-2,5,main]
 java.lang.IncompatibleClassChangeError: Found class 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
   at 
 org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {quote}
 This is because the hadoop class {{TaskAttemptContext}} is incompatible 
 between hadoop-1 and hadoop-2.
 I guess the spark artifacts in Maven Central were built against hadoop-2 with 
 Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
 the hadoop version mismatch is happend.
 FYI:
 sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
 correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-03 Thread Tom Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157946#comment-14157946
 ] 

Tom Weber commented on SPARK-3769:
--

I believe I originally called it on the driver side, but the addfile call makes 
a local copy, so when you call it there, you get the local copy path which 
isn't the same path as where it ends up on the remote worker nodes.
I'm good with striping the path off and only passing the file name itself to 
the get call.

 SparkFiles.get gives me the wrong fully qualified path
 --

 Key: SPARK-3769
 URL: https://issues.apache.org/jira/browse/SPARK-3769
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2, 1.1.0
 Environment: linux host, and linux grid.
Reporter: Tom Weber
Priority: Minor

 My spark pgm running on my host, (submitting work to my grid).
 JavaSparkContext sc =new JavaSparkContext(conf);
 final String path = args[1];
 sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
 The log shows:
 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
 /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
 http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
 those are paths on my host machine. The location that this file gets on grid 
 nodes is:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
 While the call to get the path in my code that runs in my mapPartitions 
 function on the grid nodes is:
 String pgm = SparkFiles.get(path);
 And this returns the following string:
 /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
 So, am I expected to take the qualified path that was given to me and parse 
 it to get only the file name at the end, and then concatenate that to what I 
 get from the SparkFiles.getRootDirectory() call in order to get this to work?
 Or pass only the parsed file name to the SparkFiles.get method? Seems as 
 though I should be able to pass the same file specification to both 
 sc.addFile() and SparkFiles.get() and get the correct location of the file.
 Thanks,
 Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2014-10-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3778:


 Summary: newAPIHadoopRDD doesn't properly pass credentials for 
secure hdfs on yarn
 Key: SPARK-3778
 URL: https://issues.apache.org/jira/browse/SPARK-3778
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves


The newAPIHadoopRDD routine doesn't properly add the credentials to the conf to 
be able to access secure hdfs.

Note that newAPIHadoopFile does handle these because the 
org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2256) pyspark: RDD.take doesn't work ... sometimes ...

2014-10-03 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-2256.

   Resolution: Fixed
Fix Version/s: 1.1.0

 pyspark: RDD.take doesn't work ... sometimes ...
 --

 Key: SPARK-2256
 URL: https://issues.apache.org/jira/browse/SPARK-2256
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
 Environment: local file/remote HDFS
Reporter: Ángel Álvarez
  Labels: RDD, pyspark, take, windows
 Fix For: 1.1.0

 Attachments: A_test.zip


 If I try to take some lines from a file, sometimes it doesn't work
 Code: 
 myfile = sc.textFile(A_ko)
 print myfile.take(10)
 Stacktrace:
 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
 Traceback (most recent call last):
   File mytest.py, line 19, in module
 print myfile.take(10)
   File spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py, line 868, in take
 iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py, 
 line 537, in __call__
   File 
 spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py, 
 line 300, in get_return_value
 Test data:
 START TEST DATA
 A
 A
 A
 
 
 
 
 
 
 
 
 
 

[jira] [Created] (SPARK-3779) yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period

2014-10-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3779:


 Summary: yarn spark.yarn.applicationMaster.waitTries config should 
be changed to a time period
 Key: SPARK-3779
 URL: https://issues.apache.org/jira/browse/SPARK-3779
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves


in pr https://github.com/apache/spark/pull/2577 I added support to use 
spark.yarn.applicationMaster.waitTries to client mode.  But the time it waits 
between loops is different so it could be confusing to the user.  We also don't 
document how long each loop is in the documentation so this config really isn't 
clear.

We should just changed this config to be time based,  ms or seconds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3780) YarnAllocator should look at the container completed diagnostic message

2014-10-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3780:


 Summary: YarnAllocator should look at the container completed 
diagnostic message
 Key: SPARK-3780
 URL: https://issues.apache.org/jira/browse/SPARK-3780
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves


Yarn will give us a diagnostic message along with a container complete 
notification. We should print that diagnostic message for the spark user.  

For instance, I believe if it the container gets shot for being over its memory 
limit yarn would give us a useful diagnostic saying that.  This would be really 
useful for the user to be able to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3781) code style format

2014-10-03 Thread sjk (JIRA)
sjk created SPARK-3781:
--

 Summary: code style format
 Key: SPARK-3781
 URL: https://issues.apache.org/jira/browse/SPARK-3781
 Project: Spark
  Issue Type: Improvement
Reporter: sjk






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3781) code style format

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158074#comment-14158074
 ] 

Apache Spark commented on SPARK-3781:
-

User 'shijinkui' has created a pull request for this issue:
https://github.com/apache/spark/pull/2644

 code style format
 -

 Key: SPARK-3781
 URL: https://issues.apache.org/jira/browse/SPARK-3781
 Project: Spark
  Issue Type: Improvement
Reporter: sjk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself

2014-10-03 Thread Nathan Kronenfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158176#comment-14158176
 ] 

Nathan Kronenfeld commented on SPARK-3783:
--

https://github.com/apache/spark/pull/2637

 The type parameters for SparkContext.accumulable are inconsistent Accumulable 
 itself
 

 Key: SPARK-3783
 URL: https://issues.apache.org/jira/browse/SPARK-3783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Nathan Kronenfeld
Priority: Minor
   Original Estimate: 10m
  Remaining Estimate: 10m

 SparkContext.accumulable takes type parameters [T, R] - and passes them to 
 accumulable, in that order.
 Accumulable takes type parameters [R, T].
 So T for SparkContext.accumulable corresponds with R for Accumulable and vice 
 versa.
 Minor, but very confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3785) Support off-loading computations to a GPU

2014-10-03 Thread Thomas Darimont (JIRA)
Thomas Darimont created SPARK-3785:
--

 Summary: Support off-loading computations to a GPU
 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor


Are there any plans to adding support for off-loading computations to the GPU, 
e.g. via an open-cl binding? 

http://www.jocl.org/
https://code.google.com/p/javacl/
http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2058.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

 SPARK_CONF_DIR should override all present configs
 --

 Key: SPARK-2058
 URL: https://issues.apache.org/jira/browse/SPARK-2058
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1, 1.1.0
Reporter: Eugen Cepoi
Assignee: Eugen Cepoi
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 When the user defines SPARK_CONF_DIR I think spark should use all the configs 
 available there not only spark-env.
 This involves changing SparkSubmitArguments to first read from 
 SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
 computed classpath for configs such as log4j, metrics, etc.
 I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2058:
-
Assignee: Eugen Cepoi

 SPARK_CONF_DIR should override all present configs
 --

 Key: SPARK-2058
 URL: https://issues.apache.org/jira/browse/SPARK-2058
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1, 1.1.0
Reporter: Eugen Cepoi
Assignee: Eugen Cepoi
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 When the user defines SPARK_CONF_DIR I think spark should use all the configs 
 available there not only spark-env.
 This involves changing SparkSubmitArguments to first read from 
 SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
 computed classpath for configs such as log4j, metrics, etc.
 I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset

2014-10-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158200#comment-14158200
 ] 

Josh Rosen commented on SPARK-3706:
---

This introduced a problem, since after this patch we now use IPython on the 
workers; see SPARK-3772 for more details.

 Cannot run IPython REPL with IPYTHON set to 1 and PYSPARK_PYTHON unset
 

 Key: SPARK-3706
 URL: https://issues.apache.org/jira/browse/SPARK-3706
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark
 Fix For: 1.2.0


 h3. Problem
 The section Using the shell in Spark Programming Guide 
 (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) 
 says that we can run pyspark REPL through IPython.
 But a folloing command does not run IPython but a default Python executable.
 {quote}
 $ IPYTHON=1 ./bin/pyspark
 Python 2.7.8 (default, Jul  2 2014, 10:14:46) 
 ...
 {quote}
 the spark/bin/pyspark script on the commit 
 b235e013638685758885842dc3268e9800af3678 decides which executable and options 
 it use folloing way.
 # if PYSPARK_PYTHON unset
 #* → defaulting to python
 # if IPYTHON_OPTS set
 #* → set IPYTHON 1
 # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
 #* out of this issues scope
 # if IPYTHON set as 1
 #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
 #* otherwise execute $PYSPARK_PYTHON
 Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 
 1.
 In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no 
 effect on decide which command to use.
 ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command||
 |(unset → defaults to python)|(unset)|(unset)|python|(same)|
 |(unset → defaults to python)|(unset)|1|python|ipython|
 |(unset → defaults to python)|an_option|(unset → set to 1)|python 
 an_option|ipython an_option|
 |(unset → defaults to python)|an_option|1|python an_option|ipython an_option|
 |ipython|(unset)|(unset)|ipython|(same)|
 |ipython|(unset)|1|ipython|(same)|
 |ipython|an_option|(unset → set to 1)|ipython an_option|(same)|
 |ipython|an_option|1|ipython an_option|(same)|
 h3. Suggestion
 The pyspark script should determine firstly whether a user wants to run 
 IPython or other executables.
 # if IPYTHON_OPTS set
 #* set IPYTHON 1
 # if IPYTHON has a value 1
 #* PYSPARK_PYTHON defaults to ipython if not set
 # PYSPARK_PYTHON defaults to python if not set
 See the pull request for more detailed modification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-10-03 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158197#comment-14158197
 ] 

Andrew Or commented on SPARK-2058:
--

To give a quick update, this change has not made it to any releases yet. It 
will be in the future releases 1.1.1 and 1.2.0, however.

 SPARK_CONF_DIR should override all present configs
 --

 Key: SPARK-2058
 URL: https://issues.apache.org/jira/browse/SPARK-2058
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1, 1.1.0
Reporter: Eugen Cepoi
Assignee: Eugen Cepoi
Priority: Critical
 Fix For: 1.1.1, 1.2.0


 When the user defines SPARK_CONF_DIR I think spark should use all the configs 
 available there not only spark-env.
 This involves changing SparkSubmitArguments to first read from 
 SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
 computed classpath for configs such as log4j, metrics, etc.
 I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3786) Speedup tests of PySpark

2014-10-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3786:
-

 Summary: Speedup tests of PySpark
 Key: SPARK-3786
 URL: https://issues.apache.org/jira/browse/SPARK-3786
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


It takes about 20 minutes (about 25% of all the tests) to run all the tests of 
PySpark.

The slowest ones are tests.py and streaming/tests.py, they create new JVM and 
SparkContext for each test cases, it will be faster to reuse the SparkContext 
for most of cases.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3696) Do not override user-defined conf_dir in spark-config.sh

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3696:
-
Assignee: WangTaoTheTonic

 Do not override user-defined conf_dir in spark-config.sh
 

 Key: SPARK-3696
 URL: https://issues.apache.org/jira/browse/SPARK-3696
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Now many scripts used spark-config.sh in which SPARK_CONF_DIR is directly 
 assigned with SPARK_HOME/conf. It is inconvenient for those who define 
 SPARK_CONF_DIR in env.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3696) Do not override user-defined conf_dir in spark-config.sh

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3696.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0

 Do not override user-defined conf_dir in spark-config.sh
 

 Key: SPARK-3696
 URL: https://issues.apache.org/jira/browse/SPARK-3696
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.1.1, 1.2.0


 Now many scripts used spark-config.sh in which SPARK_CONF_DIR is directly 
 assigned with SPARK_HOME/conf. It is inconvenient for those who define 
 SPARK_CONF_DIR in env.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.

2014-10-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1655:
-
Assignee: Aaron Staple

 In naive Bayes, store conditional probabilities distributively.
 ---

 Key: SPARK-1655
 URL: https://issues.apache.org/jira/browse/SPARK-1655
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 In the current implementation, we collect all conditional probabilities to 
 the driver node. When there are many labels and many features, this puts 
 heavy load on the driver. For scalability, we should provide a way to store 
 conditional probabilities distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2778) Add unit tests for Yarn integration

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2778.

  Resolution: Fixed
Target Version/s: 1.2.0

 Add unit tests for Yarn integration
 ---

 Key: SPARK-2778
 URL: https://issues.apache.org/jira/browse/SPARK-2778
 Project: Spark
  Issue Type: Test
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0

 Attachments: yarn-logs.txt


 It would be nice to add some Yarn integration tests to the unit tests in 
 Spark; Yarn provides a MiniYARNCluster class that can be used to spawn a 
 cluster locally.
 UPDATE: These tests are causing exceptions in our nightly build:
 {code}
 sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
 failed on connection exception: java.net.ConnectException: Connection 
 refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
   at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
   at org.apache.spark.SparkContext.init(SparkContext.scala:310)
   at 
 org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 

[jira] [Closed] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3710.

   Resolution: Fixed
Fix Version/s: 1.2.0

 YARN integration test is flaky
 --

 Key: SPARK-3710
 URL: https://issues.apache.org/jira/browse/SPARK-3710
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Patrick Wendell
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.2.0


 This has been regularly failing the master build:
 Example failure: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 One thing to look at is whether the YARN mini cluster makes assumptions about 
 being able to bind to specific ports.
 {code}
 sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
 failed on connection exception: java.net.ConnectException: Connection 
 refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
   at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
   at org.apache.spark.SparkContext.init(SparkContext.scala:310)
   at 
 org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
 

[jira] [Commented] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158326#comment-14158326
 ] 

Andrew Or commented on SPARK-3710:
--

https://github.com/apache/spark/pull/2605

 YARN integration test is flaky
 --

 Key: SPARK-3710
 URL: https://issues.apache.org/jira/browse/SPARK-3710
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Patrick Wendell
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.2.0


 This has been regularly failing the master build:
 Example failure: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 One thing to look at is whether the YARN mini cluster makes assumptions about 
 being able to bind to specific ports.
 {code}
 sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
 failed on connection exception: java.net.ConnectException: Connection 
 refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
   at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
   at org.apache.spark.SparkContext.init(SparkContext.scala:310)
   at 
 org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-03 Thread Reza Farivar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158363#comment-14158363
 ] 

Reza Farivar commented on SPARK-3785:
-

Olivier Chafik who wrote javacl (which you mentioned in your description) also 
has a beta stage scalacl package on github
https://github.com/ochafik/ScalaCL

There was also another project trying to get opencl in java: aparapi. The neat 
thing about aparapi is that it doesn't require you to write opencl kernels in 
C, but would translate java loops into opencl code on the run. Seems like 
ScalaCL project has similar goals for scala. 

 Support off-loading computations to a GPU
 -

 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor

 Are there any plans to adding support for off-loading computations to the 
 GPU, e.g. via an open-cl binding? 
 http://www.jocl.org/
 https://code.google.com/p/javacl/
 http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158364#comment-14158364
 ] 

Marcelo Vanzin commented on SPARK-3710:
---

Hmm. For some the e-mail for this bug ended up in my spam box. Anyway, fix was 
tracked also in SPARK-2778.

 YARN integration test is flaky
 --

 Key: SPARK-3710
 URL: https://issues.apache.org/jira/browse/SPARK-3710
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Patrick Wendell
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.2.0


 This has been regularly failing the master build:
 Example failure: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 One thing to look at is whether the YARN mini cluster makes assumptions about 
 being able to bind to specific ports.
 {code}
 sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
 failed on connection exception: java.net.ConnectException: Connection 
 refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
   at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
   at org.apache.spark.SparkContext.init(SparkContext.scala:310)
   at 
 org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at 

[jira] [Commented] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158376#comment-14158376
 ] 

Marcelo Vanzin commented on SPARK-3710:
---

I filed a Yarn bug (YARN-2642), although we can't get rid of the workaround 
since we need to support existing versions of Yarn.

 YARN integration test is flaky
 --

 Key: SPARK-3710
 URL: https://issues.apache.org/jira/browse/SPARK-3710
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Patrick Wendell
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.2.0


 This has been regularly failing the master build:
 Example failure: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
 One thing to look at is whether the YARN mini cluster makes assumptions about 
 being able to bind to specific ports.
 {code}
 sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
 failed on connection exception: java.net.ConnectException: Connection 
 refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at 
 org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
   at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
   at org.apache.spark.SparkContext.init(SparkContext.scala:310)
   at 
 org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   

[jira] [Resolved] (SPARK-3007) Add Dynamic Partition support to Spark Sql hive

2014-10-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3007.

Resolution: Fixed

Okay, this was merged again:

https://github.com/apache/spark/pull/2616

 Add Dynamic Partition support  to  Spark Sql hive
 ---

 Key: SPARK-3007
 URL: https://issues.apache.org/jira/browse/SPARK-3007
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: baishuo
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3212) Improve the clarity of caching semantics

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3212.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2501
[https://github.com/apache/spark/pull/2501]

 Improve the clarity of caching semantics
 

 Key: SPARK-3212
 URL: https://issues.apache.org/jira/browse/SPARK-3212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 Right now there are a bunch of different ways to cache tables in Spark SQL. 
 For example:
  - tweets.cache()
  - sql(SELECT * FROM tweets).cache()
  - table(tweets).cache()
  - tweets.cache().registerTempTable(tweets)
  - sql(CACHE TABLE tweets)
  - cacheTable(tweets)
 Each of the above commands has subtly different semantics, leading to a very 
 confusing user experience.  Ideally, we would stop doing caching based on 
 simple tables names and instead have a phase of optimization that does 
 intelligent matching of query plans with available cached data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1379) Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1379.
-
Resolution: Fixed

 Calling .cache() on a SchemaRDD should do something more efficient than 
 caching the individual row objects.
 ---

 Key: SPARK-1379
 URL: https://issues.apache.org/jira/browse/SPARK-1379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 Since rows aren't black boxes we could use InMemoryColumnarTableScan.  This 
 would significantly reduce GC pressure on the workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3641.
-
Resolution: Fixed
  Assignee: Michael Armbrust  (was: Yin Huai)

 Correctly populate SparkPlan.currentContext
 ---

 Key: SPARK-3641
 URL: https://issues.apache.org/jira/browse/SPARK-3641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yin Huai
Assignee: Michael Armbrust
Priority: Critical

 After creating a new SQLContext, we need to populate SparkPlan.currentContext 
 before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
 populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
 call and we can have NPE as described in 
 http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1671) Cached tables should follow write-through policy

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1671.
-
Resolution: Fixed

I'm gonna mark this as resolved now that we do at least invalidate the cache 
when writing through.  We can create a follow up JIRA for partial invalidation 
if we want.

 Cached tables should follow write-through policy
 

 Key: SPARK-1671
 URL: https://issues.apache.org/jira/browse/SPARK-1671
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian
Assignee: Michael Armbrust
  Labels: cache, column

 Writing (insert / load) to a cached table causes cache inconsistency, and 
 user have to unpersist and cache the whole table again.
 The write-through policy may be implemented with {{RDD.union}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:

Assignee: Cheng Lian  (was: Michael Armbrust)

 Add a way to show tables without executing a job
 

 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 Right now, sql(show tables).collect() will start a Spark job which shows up 
 in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3535.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews
Assignee: Brenden Matthews
 Fix For: 1.1.1, 1.2.0


 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3535:
-
Assignee: Brenden Matthews

 Spark on Mesos not correctly setting heap overhead
 --

 Key: SPARK-3535
 URL: https://issues.apache.org/jira/browse/SPARK-3535
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.1.0
Reporter: Brenden Matthews
Assignee: Brenden Matthews

 Spark on Mesos does account for any memory overhead.  The result is that 
 tasks are OOM killed nearly 95% of the time.
 Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
 executor memory for JVM overhead.
 For example, see: 
 https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3775:
-
Affects Version/s: 1.1.0

 Not suitable error message in spark-shell.cmd
 -

 Key: SPARK-3775
 URL: https://issues.apache.org/jira/browse/SPARK-3775
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Priority: Trivial

 In Windows environment.
 When we execute bin\spark-shell.cmd before we build spark, we get the error 
 message like this.
 {quote}
 Failed to find Spark assembly JAR.
 You need to build Spark with sbt\sbt assembly before running this program.
 {quote}
 But this message is not suitable because ...
 * Maven is also available to build Spark, and it works in Windows without 
 cygwin now ([SPARK-3061]).
 * The equivalent error message of linux version (bin/spark-shell) doesn't 
 mention the way to build.
 bq. You need to build Spark before running this program.
 * sbt\sbt can't be executed in Windows without cygwin because it's bash 
 script.
 So this message should be modified as same as the linux version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3775.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.1.1, 1.2.0

 Not suitable error message in spark-shell.cmd
 -

 Key: SPARK-3775
 URL: https://issues.apache.org/jira/browse/SPARK-3775
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Trivial
 Fix For: 1.1.1, 1.2.0


 In Windows environment.
 When we execute bin\spark-shell.cmd before we build spark, we get the error 
 message like this.
 {quote}
 Failed to find Spark assembly JAR.
 You need to build Spark with sbt\sbt assembly before running this program.
 {quote}
 But this message is not suitable because ...
 * Maven is also available to build Spark, and it works in Windows without 
 cygwin now ([SPARK-3061]).
 * The equivalent error message of linux version (bin/spark-shell) doesn't 
 mention the way to build.
 bq. You need to build Spark before running this program.
 * sbt\sbt can't be executed in Windows without cygwin because it's bash 
 script.
 So this message should be modified as same as the linux version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3774) typo comment in bin/utils.sh

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3774.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.1.1, 1.2.0

 typo comment in bin/utils.sh
 

 Key: SPARK-3774
 URL: https://issues.apache.org/jira/browse/SPARK-3774
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Shell
Affects Versions: 1.1.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Trivial
 Fix For: 1.1.1, 1.2.0


 typo comment in bin/utils.sh
 {code}
 # Gather all all spark-submit options into SUBMISSION_OPTS
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
Fix Version/s: 1.2.0

 Spark-on-Yarn AmIpFilter does not work with Yarn HA.
 

 Key: SPARK-3606
 URL: https://issues.apache.org/jira/browse/SPARK-3606
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 The current IP filter only considers one of the RMs in an HA setup. If the 
 active RM is not the configured one, you get a connection refused error 
 when clicking on the Spark AM links in the RM UI.
 Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
 Target Version/s: 1.1.1, 1.2.0
Affects Version/s: (was: 1.2.0)

 Spark-on-Yarn AmIpFilter does not work with Yarn HA.
 

 Key: SPARK-3606
 URL: https://issues.apache.org/jira/browse/SPARK-3606
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 The current IP filter only considers one of the RMs in an HA setup. If the 
 active RM is not the configured one, you get a connection refused error 
 when clicking on the Spark AM links in the RM UI.
 Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
Affects Version/s: 1.2.0

 Spark-on-Yarn AmIpFilter does not work with Yarn HA.
 

 Key: SPARK-3606
 URL: https://issues.apache.org/jira/browse/SPARK-3606
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 The current IP filter only considers one of the RMs in an HA setup. If the 
 active RM is not the configured one, you get a connection refused error 
 when clicking on the Spark AM links in the RM UI.
 Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3763) The example of building with sbt should be sbt assembly instead of sbt compile

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3763.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

 The example of building with sbt should be sbt assembly instead of sbt 
 compile
 --

 Key: SPARK-3763
 URL: https://issues.apache.org/jira/browse/SPARK-3763
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Trivial
 Fix For: 1.2.0


 In building-spark.md, there are some examples for making assembled package 
 with maven but the example for building with sbt is only about for compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-1860.
---
Resolution: Fixed

Fixed by mccheah in https://github.com/apache/spark/pull/2609

 Standalone Worker cleanup should not clean up running executors
 ---

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Blocker

 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 executors that happen to be running for longer than 7 days, hitting streaming 
 jobs especially hard.
 Executor's log/data folders should not be cleaned up if they're still 
 running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3786) Speedup tests of PySpark

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158552#comment-14158552
 ] 

Apache Spark commented on SPARK-3786:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2646

 Speedup tests of PySpark
 

 Key: SPARK-3786
 URL: https://issues.apache.org/jira/browse/SPARK-3786
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu

 It takes about 20 minutes (about 25% of all the tests) to run all the tests 
 of PySpark.
 The slowest ones are tests.py and streaming/tests.py, they create new JVM and 
 SparkContext for each test cases, it will be faster to reuse the SparkContext 
 for most of cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-10-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158571#comment-14158571
 ] 

Sandy Ryza commented on SPARK-3561:
---

I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an execution environment, but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That, 
the Spark API is currently very tightly integrated with its execution engine, 
and frankly, decoupling the two so that Spark would be able to run on top of 
execution engines with similar properties seems more trouble than its worth.

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3787:
-

 Summary: Assembly jar name is wrong when we build with sbt 
omitting -Dhadoop.version
 Key: SPARK-3787
 URL: https://issues.apache.org/jira/browse/SPARK-3787
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3787:
--
Description: 
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}

jar name is always used default version (1.0.4).

When we build with maven with same condition for sbt, default version for each 
profile.
For instance, if we  build like:

{code}
mvn -Phadoop-2.2 package
{code}

jar name is used hadoop2.2.0 as a default version of hadoop-2.2.

  was:
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}


 Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
 ---

 Key: SPARK-3787
 URL: https://issues.apache.org/jira/browse/SPARK-3787
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 When we build with sbt with profile for hadoop and without property for 
 hadoop version like:
 {code}
 sbt/sbt -Phadoop-2.2 assembly
 {code}
 jar name is always used default version (1.0.4).
 When we build with maven with same condition for sbt, default version for 
 each profile.
 For instance, if we  build like:
 {code}
 mvn -Phadoop-2.2 package
 {code}
 jar name is used hadoop2.2.0 as a default version of hadoop-2.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158602#comment-14158602
 ] 

Apache Spark commented on SPARK-3787:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2647

 Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
 ---

 Key: SPARK-3787
 URL: https://issues.apache.org/jira/browse/SPARK-3787
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 When we build with sbt with profile for hadoop and without property for 
 hadoop version like:
 {code}
 sbt/sbt -Phadoop-2.2 assembly
 {code}
 jar name is always used default version (1.0.4).
 When we build with maven with same condition for sbt, default version for 
 each profile.
 For instance, if we  build like:
 {code}
 mvn -Phadoop-2.2 package
 {code}
 jar name is used hadoop2.2.0 as a default version of hadoop-2.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3787:
--
Description: 
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}

jar name is always used default version (1.0.4).

When we build with maven with same condition for sbt, default version for each 
profile is used.
For instance, if we  build like:

{code}
mvn -Phadoop-2.2 package
{code}

jar name is used hadoop2.2.0 as a default version of hadoop-2.2.

  was:
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}

jar name is always used default version (1.0.4).

When we build with maven with same condition for sbt, default version for each 
profile.
For instance, if we  build like:

{code}
mvn -Phadoop-2.2 package
{code}

jar name is used hadoop2.2.0 as a default version of hadoop-2.2.


 Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
 ---

 Key: SPARK-3787
 URL: https://issues.apache.org/jira/browse/SPARK-3787
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 When we build with sbt with profile for hadoop and without property for 
 hadoop version like:
 {code}
 sbt/sbt -Phadoop-2.2 assembly
 {code}
 jar name is always used default version (1.0.4).
 When we build with maven with same condition for sbt, default version for 
 each profile is used.
 For instance, if we  build like:
 {code}
 mvn -Phadoop-2.2 package
 {code}
 jar name is used hadoop2.2.0 as a default version of hadoop-2.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by helping with resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 

 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158614#comment-14158614
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 10/3/14 10:34 PM:
---

[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by facilitating resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 


was (Author: ozhurakousky):
[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by helping with resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 

 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark's API is tightly coupled with its backend execution engine.   
 It could be useful to provide a point of pluggability between the two to 
 allow Spark to run on other DAG execution engines with similar distributed 
 memory abstractions.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
execution-context:foo.bar.MyJobExecutionContext with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark's API is tightly coupled with its backend execution engine.   
 It could be useful to provide a point of pluggability between the two to 
 allow Spark to run on other DAG execution engines with similar distributed 
 memory abstractions.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158571#comment-14158571
 ] 

Sandy Ryza edited comment on SPARK-3561 at 10/3/14 11:00 PM:
-

I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an execution environment, but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That 
said, the Spark API is currently very tightly integrated with its execution 
engine, and frankly, decoupling the two so that Spark would be able to run on 
top of execution engines with similar properties seems more trouble than its 
worth.


was (Author: sandyr):
I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an execution environment, but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That, 
the Spark API is currently very tightly integrated with its execution engine, 
and frankly, decoupling the two so that Spark would be able to run on top of 
execution engines with similar properties seems more trouble than its worth.

 Decouple Spark's API from its execution engine
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark's user-facing API is tightly coupled with its backend 
 execution engine.   It could be useful to provide a point of pluggability 
 between the two to allow Spark to run on other DAG execution engines with 
 similar distributed memory abstractions.
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-3788:
-

 Summary: Yarn dist cache code is not friendly to HDFS HA, 
Federation
 Key: SPARK-3788
 URL: https://issues.apache.org/jira/browse/SPARK-3788
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin


There are two bugs here.

1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
URI to be an actual host. In the case of HA and Federation, that's a namespace 
name, which doesn't resolve to anything. So in those cases, {{compareFs()}} 
always says the file systems are different.

2. In {{prepareLocalResources()}}, when adding a file to the distributed cache, 
that is done with the common FileSystem object instantiated at the start of the 
method. In the case of Federation that doesn't work: the qualified URL's scheme 
may differ from the non-qualified one, so the FileSystem instance will not work.

Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-03 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158646#comment-14158646
 ] 

Andrew Ash commented on SPARK-1860:
---

[~ilikerps] this ticket mentioned turning the cleanup code on by default once 
this ticket was fixed.  Should we change the defaults to have this on by 
default?

 Standalone Worker cleanup should not clean up running executors
 ---

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Blocker

 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 executors that happen to be running for longer than 7 days, hitting streaming 
 jobs especially hard.
 Executor's log/data folders should not be cleaned up if they're still 
 running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158657#comment-14158657
 ] 

Marcelo Vanzin commented on SPARK-3788:
---

Note: 2 above only applies to branch-1.1. It was fixed in master by 
https://github.com/apache/spark/commit/c4022dd5.

 Yarn dist cache code is not friendly to HDFS HA, Federation
 ---

 Key: SPARK-3788
 URL: https://issues.apache.org/jira/browse/SPARK-3788
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 There are two bugs here.
 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
 URI to be an actual host. In the case of HA and Federation, that's a 
 namespace name, which doesn't resolve to anything. So in those cases, 
 {{compareFs()}} always says the file systems are different.
 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
 cache, that is done with the common FileSystem object instantiated at the 
 start of the method. In the case of Federation that doesn't work: the 
 qualified URL's scheme may differ from the non-qualified one, so the 
 FileSystem instance will not work.
 Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158710#comment-14158710
 ] 

Marcelo Vanzin commented on SPARK-3788:
---

Ah, 2 was fixed in branch-1.1 as part of SPARK-2577. So only issue 1 remains.

 Yarn dist cache code is not friendly to HDFS HA, Federation
 ---

 Key: SPARK-3788
 URL: https://issues.apache.org/jira/browse/SPARK-3788
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 There are two bugs here.
 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
 URI to be an actual host. In the case of HA and Federation, that's a 
 namespace name, which doesn't resolve to anything. So in those cases, 
 {{compareFs()}} always says the file systems are different.
 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
 cache, that is done with the common FileSystem object instantiated at the 
 start of the method. In the case of Federation that doesn't work: the 
 qualified URL's scheme may differ from the non-qualified one, so the 
 FileSystem instance will not work.
 Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3789) Python bindings for GraphX

2014-10-03 Thread Ameet Talwalkar (JIRA)
Ameet Talwalkar created SPARK-3789:
--

 Summary: Python bindings for GraphX
 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158725#comment-14158725
 ] 

Apache Spark commented on SPARK-3788:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2650

 Yarn dist cache code is not friendly to HDFS HA, Federation
 ---

 Key: SPARK-3788
 URL: https://issues.apache.org/jira/browse/SPARK-3788
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin

 There are two bugs here.
 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
 URI to be an actual host. In the case of HA and Federation, that's a 
 namespace name, which doesn't resolve to anything. So in those cases, 
 {{compareFs()}} always says the file systems are different.
 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
 cache, that is done with the common FileSystem object instantiated at the 
 start of the method. In the case of Federation that doesn't work: the 
 qualified URL's scheme may differ from the non-qualified one, so the 
 FileSystem instance will not work.
 Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3314) Script creation of AMIs

2014-10-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158777#comment-14158777
 ] 

Nicholas Chammas commented on SPARK-3314:
-

Hey [~holdenk], I think this is a great issue to work on. There was a related 
discussion on the dev list about using [Packer|http://www.packer.io/] to do 
this. I will be looking into this option and will report back here.

 Script creation of AMIs
 ---

 Key: SPARK-3314
 URL: https://issues.apache.org/jira/browse/SPARK-3314
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: holdenk
Priority: Minor

 The current Spark AMIs have been built up over time. It would be useful to 
 provide a script which can be used to bootstrap from a fresh Amazon AMI. We 
 could also update the AMIs in the project at the same time to be based on a 
 newer version so we don't have to wait so long for the security updates to be 
 installed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158865#comment-14158865
 ] 

Apache Spark commented on SPARK-3772:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2651

 RDD operation on IPython REPL failed with an illegal port number
 

 Key: SPARK-3772
 URL: https://issues.apache.org/jira/browse/SPARK-3772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark

 To reproduce this issue, we should execute following commands on the commit: 
 6e27cb630de69fa5acb510b4e2f6b980742b1957.
 {quote}
 $ PYSPARK_PYTHON=ipython ./bin/pyspark
 ...
 In [1]: file = sc.textFile('README.md')
 In [2]: file.first()
 ...
 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:334
 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at 
 PythonRDD.scala:334) with 1 output partitions (allowLocal=true)
 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
 PythonRDD.scala:334)
 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD 
 at PythonRDD.scala:44), which has no missing parents
 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
 curMem=57388, maxMem=278019440
 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 4.4 KB, free 265.1 MB)
 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[2] at RDD at PythonRDD.scala:44)
 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1207 bytes)
 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: port out of range:1027423549
   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
   at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
   at java.net.Socket.init(Socket.java:244)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:744)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-03 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158949#comment-14158949
 ] 

Reza Zadeh commented on SPARK-3434:
---

Any updates Shivaraman?

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org