[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-10-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-2630.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

> Input data size of CoalescedRDD is incorrect
> 
>
> Key: SPARK-2630
> URL: https://issues.apache.org/jira/browse/SPARK-2630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Assignee: Andrew Ash
>Priority: Blocker
> Fix For: 1.2.0
>
> Attachments: overflow.tiff
>
>
> Given one big file, such as text.4.3G, put it in one task, 
> {code}
> sc.textFile("text.4.3.G").coalesce(1).count()
> {code}
> In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2630) Input data size of CoalescedRDD is incorrect

2014-10-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-2630:
---

not merged yet, sorry.

> Input data size of CoalescedRDD is incorrect
> 
>
> Key: SPARK-2630
> URL: https://issues.apache.org/jira/browse/SPARK-2630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Davies Liu
>Assignee: Andrew Ash
>Priority: Blocker
> Fix For: 1.2.0
>
> Attachments: overflow.tiff
>
>
> Given one big file, such as text.4.3G, put it in one task, 
> {code}
> sc.textFile("text.4.3.G").coalesce(1).count()
> {code}
> In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2256) pyspark: .take doesn't work ... sometimes ...

2014-10-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157765#comment-14157765
 ] 

Ángel Álvarez commented on SPARK-2256:
--

It seems the problem has been solved in Spark 1.1.0 !!! 


> pyspark: .take doesn't work ... sometimes ...
> --
>
> Key: SPARK-2256
> URL: https://issues.apache.org/jira/browse/SPARK-2256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
> Environment: local file/remote HDFS
>Reporter: Ángel Álvarez
>  Labels: RDD, pyspark, take, windows
> Attachments: A_test.zip
>
>
> If I try to "take" some lines from a file, sometimes it doesn't work
> Code: 
> myfile = sc.textFile("A_ko")
> print myfile.take(10)
> Stacktrace:
> 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
> Traceback (most recent call last):
>   File "mytest.py", line 19, in 
> print myfile.take(10)
>   File "spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take
> iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
>   File 
> "spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py", 
> line 537, in __call__
>   File 
> "spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py", 
> line 300, in get_return_value
> Test data:
> 
> A
> A
> A
> 
> 
> 
> 
> 
> 
> 
> 
> 
> AAA

[jira] [Created] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-3775:


 Summary: Not suitable error message in spark-shell.cmd
 Key: SPARK-3775
 URL: https://issues.apache.org/jira/browse/SPARK-3775
 Project: Spark
  Issue Type: Improvement
Reporter: Masayoshi TSUZUKI
Priority: Trivial


In Windows environment.
When we execute bin\spark-shell.cmd before we build spark, we get the error 
message like this.

{quote}
Failed to find Spark assembly JAR.
You need to build Spark with sbt\sbt assembly before running this program.
{quote}

But this message is not suitable because ...
* Maven is also available to build Spark, and it works in Windows without 
cygwin now ([SPARK-3061]).
* The equivalent error message of linux version (bin/spark-shell) doesn't 
mention the way to build.
bq. You need to build Spark before running this program.
* sbt\sbt can't be executed in Windows without cygwin because it's bash script.

So this message should be modified as same as the linux version.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157818#comment-14157818
 ] 

Apache Spark commented on SPARK-3775:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/2640

> Not suitable error message in spark-shell.cmd
> -
>
> Key: SPARK-3775
> URL: https://issues.apache.org/jira/browse/SPARK-3775
> Project: Spark
>  Issue Type: Improvement
>Reporter: Masayoshi TSUZUKI
>Priority: Trivial
>
> In Windows environment.
> When we execute bin\spark-shell.cmd before we build spark, we get the error 
> message like this.
> {quote}
> Failed to find Spark assembly JAR.
> You need to build Spark with sbt\sbt assembly before running this program.
> {quote}
> But this message is not suitable because ...
> * Maven is also available to build Spark, and it works in Windows without 
> cygwin now ([SPARK-3061]).
> * The equivalent error message of linux version (bin/spark-shell) doesn't 
> mention the way to build.
> bq. You need to build Spark before running this program.
> * sbt\sbt can't be executed in Windows without cygwin because it's bash 
> script.
> So this message should be modified as same as the linux version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3366) Compute best splits distributively in decision tree

2014-10-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3366.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2595
[https://github.com/apache/spark/pull/2595]

> Compute best splits distributively in decision tree
> ---
>
> Key: SPARK-3366
> URL: https://issues.apache.org/jira/browse/SPARK-3366
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Qiping Li
> Fix For: 1.2.0
>
>
> The current implementation computes all best splits locally on the driver, 
> which makes the driver a bottleneck for both communication and computation. 
> It would be nice if we can compute the best splits distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]

2014-10-03 Thread Renat Yusupov (JIRA)
Renat Yusupov created SPARK-3776:


 Summary: Wrong conversion to Catalyst for Option[Product]
 Key: SPARK-3776
 URL: https://issues.apache.org/jira/browse/SPARK-3776
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Renat Yusupov
 Fix For: 1.2.0


Method ScalaReflection.convertToCatalyst make wrong conversion for 
Option[Product] data.
For example:
case class A(intValue: Int)
case class B(optionA: Option[A])
val b = B(Some(A(5)))
ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3776) Wrong conversion to Catalyst for Option[Product]

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157913#comment-14157913
 ] 

Apache Spark commented on SPARK-3776:
-

User 'r3natko' has created a pull request for this issue:
https://github.com/apache/spark/pull/2641

> Wrong conversion to Catalyst for Option[Product]
> 
>
> Key: SPARK-3776
> URL: https://issues.apache.org/jira/browse/SPARK-3776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Renat Yusupov
> Fix For: 1.2.0
>
>
> Method ScalaReflection.convertToCatalyst make wrong conversion for 
> Option[Product] data.
> For example:
> case class A(intValue: Int)
> case class B(optionA: Option[A])
> val b = B(Some(A(5)))
> ScalaReflection.convertToCatalyst(b) returns Seq(A(5)) instead of Seq(Seq(5))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2421) Spark should treat writable as serializable for keys

2014-10-03 Thread Brian Husted (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157936#comment-14157936
 ] 

Brian Husted commented on SPARK-2421:
-

To work around the problem, one must map the Writable to a String 
(org.apache.hadoop.io.Text in the case below).  This an issue when sorting 
large amounts of data since Spark will attempt to write out the entire dataset 
(spill) to perform the data conversion.  On a 500GB file this fills up more 
than 100GB of space on each node in our 12 node cluster which is very 
inefficient.  We are currently using Spark 1.0.2.  Any thoughts here are 
appreciated.

Our code that attempts to mimic map/reduce sort in Spark:

//read in the hadoop sequence file to sort
 val file = sc.sequenceFile(input, classOf[Text], classOf[Text])

//this is the code we would like to avoid that maps the Hadoop Text Input to 
Strings so the sortyByKey will run
 file.map{ case (k,v) => (k.toString(), v.toString())}

//perform the sort on the converted data
val sortedOutput = file.sortByKey(true, 1)

//write out the results as a sequence file
sortedOutput.saveAsSequenceFile(output, Some(classOf[DefaultCodec])) 

> Spark should treat writable as serializable for keys
> 
>
> Key: SPARK-2421
> URL: https://issues.apache.org/jira/browse/SPARK-2421
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>
> It seems that Spark requires the key be serializable (class implement 
> Serializable interface). In Hadoop world, Writable interface is used for the 
> same purpose. A lot of existing classes, while writable, are not considered 
> by Spark as Serializable. It would be nice if Spark can treate Writable as 
> serializable and automatically serialize and de-serialize these classes using 
> writable interface.
> This is identified in HIVE-7279, but its benefits are seen global.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3777) Display "Executor ID" for Tasks in Stage page

2014-10-03 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-3777:
---

 Summary: Display "Executor ID" for Tasks in Stage page
 Key: SPARK-3777
 URL: https://issues.apache.org/jira/browse/SPARK-3777
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0, 1.0.2, 1.0.0
Reporter: Shixiong Zhu
Priority: Minor


Now the Stage page only displays "Executor"(host) for tasks. However, there may 
be more than one Executors running in the same host. Currently, when some task 
is hung, I only know the host of the faulty executor. Therefore I have to check 
all executors in the host.

Adding "Executor ID" would be helpful to locate the faulty executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3777) Display "Executor ID" for Tasks in Stage page

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157944#comment-14157944
 ] 

Apache Spark commented on SPARK-3777:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/2642

> Display "Executor ID" for Tasks in Stage page
> -
>
> Key: SPARK-3777
> URL: https://issues.apache.org/jira/browse/SPARK-3777
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0, 1.0.2, 1.1.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: easy
>
> Now the Stage page only displays "Executor"(host) for tasks. However, there 
> may be more than one Executors running in the same host. Currently, when some 
> task is hung, I only know the host of the faulty executor. Therefore I have 
> to check all executors in the host.
> Adding "Executor ID" would be helpful to locate the faulty executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3764.
--
Resolution: Not a Problem

> Invalid dependencies of artifacts in Maven Central Repository.
> --
>
> Key: SPARK-3764
> URL: https://issues.apache.org/jira/browse/SPARK-3764
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Takuya Ueshin
>
> While testing my spark applications locally using spark artifacts downloaded 
> from Maven Central, the following exception was thrown:
> {quote}
> ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
> Thread[Executor task launch worker-2,5,main]
> java.lang.IncompatibleClassChangeError: Found class 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
>   at 
> org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> This is because the hadoop class {{TaskAttemptContext}} is incompatible 
> between hadoop-1 and hadoop-2.
> I guess the spark artifacts in Maven Central were built against hadoop-2 with 
> Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
> the hadoop version mismatch is happend.
> FYI:
> sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
> correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3764) Invalid dependencies of artifacts in Maven Central Repository.

2014-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-3764:
--

This should not be "Fixed" since there was no change or fix that followed.

> Invalid dependencies of artifacts in Maven Central Repository.
> --
>
> Key: SPARK-3764
> URL: https://issues.apache.org/jira/browse/SPARK-3764
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Takuya Ueshin
>
> While testing my spark applications locally using spark artifacts downloaded 
> from Maven Central, the following exception was thrown:
> {quote}
> ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread 
> Thread[Executor task launch worker-2,5,main]
> java.lang.IncompatibleClassChangeError: Found class 
> org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
>   at 
> org.apache.spark.sql.parquet.AppendingParquetOutputFormat.getDefaultWorkFile(ParquetTableOperations.scala:334)
>   at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:251)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>   at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}
> This is because the hadoop class {{TaskAttemptContext}} is incompatible 
> between hadoop-1 and hadoop-2.
> I guess the spark artifacts in Maven Central were built against hadoop-2 with 
> Maven, but the depending version of hadoop in {{pom.xml}} remains 1.0.4, so 
> the hadoop version mismatch is happend.
> FYI:
> sbt seems to publish 'effective pom'-like pom file, so the dependencies are 
> correctly resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3769) SparkFiles.get gives me the wrong fully qualified path

2014-10-03 Thread Tom Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157946#comment-14157946
 ] 

Tom Weber commented on SPARK-3769:
--

I believe I originally called it on the driver side, but the addfile call makes 
a local copy, so when you call it there, you get the local copy path which 
isn't the same path as where it ends up on the remote worker nodes.
I'm good with striping the path off and only passing the file name itself to 
the get call.

> SparkFiles.get gives me the wrong fully qualified path
> --
>
> Key: SPARK-3769
> URL: https://issues.apache.org/jira/browse/SPARK-3769
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.2, 1.1.0
> Environment: linux host, and linux grid.
>Reporter: Tom Weber
>Priority: Minor
>
> My spark pgm running on my host, (submitting work to my grid).
> JavaSparkContext sc =new JavaSparkContext(conf);
> final String path = args[1];
> sc.addFile(path); /* args[1] = /opt/tom/SparkFiles.sas */
> The log shows:
> 14/10/02 16:07:14 INFO Utils: Copying /opt/tom/SparkFiles.sas to 
> /tmp/spark-4c661c3f-cb57-4c9f-a0e9-c2162a89db77/SparkFiles.sas
> 14/10/02 16:07:15 INFO SparkContext: Added file /opt/tom/SparkFiles.sas at 
> http://10.20.xx.xx:49587/files/SparkFiles.sas with timestamp 1412280434986
> those are paths on my host machine. The location that this file gets on grid 
> nodes is:
> /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/SparkFiles.sas
> While the call to get the path in my code that runs in my mapPartitions 
> function on the grid nodes is:
> String pgm = SparkFiles.get(path);
> And this returns the following string:
> /opt/tom/spark-1.1.0-bin-hadoop2.4/work/app-20141002160704-0002/1/./opt/tom/SparkFiles.sas
> So, am I expected to take the qualified path that was given to me and parse 
> it to get only the file name at the end, and then concatenate that to what I 
> get from the SparkFiles.getRootDirectory() call in order to get this to work?
> Or pass only the parsed file name to the SparkFiles.get method? Seems as 
> though I should be able to pass the same file specification to both 
> sc.addFile() and SparkFiles.get() and get the correct location of the file.
> Thanks,
> Tom



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2014-10-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3778:


 Summary: newAPIHadoopRDD doesn't properly pass credentials for 
secure hdfs on yarn
 Key: SPARK-3778
 URL: https://issues.apache.org/jira/browse/SPARK-3778
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves


The newAPIHadoopRDD routine doesn't properly add the credentials to the conf to 
be able to access secure hdfs.

Note that newAPIHadoopFile does handle these because the 
org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2256) pyspark: .take doesn't work ... sometimes ...

2014-10-03 Thread Matthew Farrellee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee closed SPARK-2256.

   Resolution: Fixed
Fix Version/s: 1.1.0

> pyspark: .take doesn't work ... sometimes ...
> --
>
> Key: SPARK-2256
> URL: https://issues.apache.org/jira/browse/SPARK-2256
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
> Environment: local file/remote HDFS
>Reporter: Ángel Álvarez
>  Labels: RDD, pyspark, take, windows
> Fix For: 1.1.0
>
> Attachments: A_test.zip
>
>
> If I try to "take" some lines from a file, sometimes it doesn't work
> Code: 
> myfile = sc.textFile("A_ko")
> print myfile.take(10)
> Stacktrace:
> 14/06/24 09:29:27 INFO DAGScheduler: Failed to run take at mytest.py:19
> Traceback (most recent call last):
>   File "mytest.py", line 19, in 
> print myfile.take(10)
>   File "spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take
> iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
>   File 
> "spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py", 
> line 537, in __call__
>   File 
> "spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py", 
> line 300, in get_return_value
> Test data:
> 
> A
> A
> A
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

[jira] [Created] (SPARK-3779) yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period

2014-10-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3779:


 Summary: yarn spark.yarn.applicationMaster.waitTries config should 
be changed to a time period
 Key: SPARK-3779
 URL: https://issues.apache.org/jira/browse/SPARK-3779
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves


in pr https://github.com/apache/spark/pull/2577 I added support to use 
spark.yarn.applicationMaster.waitTries to client mode.  But the time it waits 
between loops is different so it could be confusing to the user.  We also don't 
document how long each loop is in the documentation so this config really isn't 
clear.

We should just changed this config to be time based,  ms or seconds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3780) YarnAllocator should look at the container completed diagnostic message

2014-10-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3780:


 Summary: YarnAllocator should look at the container completed 
diagnostic message
 Key: SPARK-3780
 URL: https://issues.apache.org/jira/browse/SPARK-3780
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves


Yarn will give us a diagnostic message along with a container complete 
notification. We should print that diagnostic message for the spark user.  

For instance, I believe if it the container gets shot for being over its memory 
limit yarn would give us a useful diagnostic saying that.  This would be really 
useful for the user to be able to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3781) code style format

2014-10-03 Thread sjk (JIRA)
sjk created SPARK-3781:
--

 Summary: code style format
 Key: SPARK-3781
 URL: https://issues.apache.org/jira/browse/SPARK-3781
 Project: Spark
  Issue Type: Improvement
Reporter: sjk






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3781) code style format

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158074#comment-14158074
 ] 

Apache Spark commented on SPARK-3781:
-

User 'shijinkui' has created a pull request for this issue:
https://github.com/apache/spark/pull/2644

> code style format
> -
>
> Key: SPARK-3781
> URL: https://issues.apache.org/jira/browse/SPARK-3781
> Project: Spark
>  Issue Type: Improvement
>Reporter: sjk
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3782) AkkaUtils directly using log4j

2014-10-03 Thread Martin Gilday (JIRA)
Martin Gilday created SPARK-3782:


 Summary: AkkaUtils directly using log4j
 Key: SPARK-3782
 URL: https://issues.apache.org/jira/browse/SPARK-3782
 Project: Spark
  Issue Type: Bug
Reporter: Martin Gilday


AkkaUtils is calling setLevel on Logger from log4j. This causes issues when 
using another implementation of SLF4J such as logback as log4j-over-slf4j.jars 
implementation of this class does not contain this method on Logger.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself

2014-10-03 Thread Nathan Kronenfeld (JIRA)
Nathan Kronenfeld created SPARK-3783:


 Summary: The type parameters for SparkContext.accumulable are 
inconsistent Accumulable itself
 Key: SPARK-3783
 URL: https://issues.apache.org/jira/browse/SPARK-3783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Nathan Kronenfeld
Priority: Minor


SparkContext.accumulable takes type parameters [T, R] - and passes them to 
accumulable, in that order.
Accumulable takes type parameters [R, T].
So T for SparkContext.accumulable corresponds with R for Accumulable and vice 
versa.
Minor, but very confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3783) The type parameters for SparkContext.accumulable are inconsistent Accumulable itself

2014-10-03 Thread Nathan Kronenfeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158176#comment-14158176
 ] 

Nathan Kronenfeld commented on SPARK-3783:
--

https://github.com/apache/spark/pull/2637

> The type parameters for SparkContext.accumulable are inconsistent Accumulable 
> itself
> 
>
> Key: SPARK-3783
> URL: https://issues.apache.org/jira/browse/SPARK-3783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nathan Kronenfeld
>Priority: Minor
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> SparkContext.accumulable takes type parameters [T, R] - and passes them to 
> accumulable, in that order.
> Accumulable takes type parameters [R, T].
> So T for SparkContext.accumulable corresponds with R for Accumulable and vice 
> versa.
> Minor, but very confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3785) Support off-loading computations to a GPU

2014-10-03 Thread Thomas Darimont (JIRA)
Thomas Darimont created SPARK-3785:
--

 Summary: Support off-loading computations to a GPU
 Key: SPARK-3785
 URL: https://issues.apache.org/jira/browse/SPARK-3785
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor


Are there any plans to adding support for off-loading computations to the GPU, 
e.g. via an open-cl binding? 

http://www.jocl.org/
https://code.google.com/p/javacl/
http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3784) Support off-loading computations to a GPU

2014-10-03 Thread Thomas Darimont (JIRA)
Thomas Darimont created SPARK-3784:
--

 Summary: Support off-loading computations to a GPU
 Key: SPARK-3784
 URL: https://issues.apache.org/jira/browse/SPARK-3784
 Project: Spark
  Issue Type: Brainstorming
  Components: MLlib
Reporter: Thomas Darimont
Priority: Minor


Are there any plans to adding support for off-loading computations to the GPU, 
e.g. via an open-cl binding? 

http://www.jocl.org/
https://code.google.com/p/javacl/
http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2058.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

> SPARK_CONF_DIR should override all present configs
> --
>
> Key: SPARK-2058
> URL: https://issues.apache.org/jira/browse/SPARK-2058
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Eugen Cepoi
>Assignee: Eugen Cepoi
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> When the user defines SPARK_CONF_DIR I think spark should use all the configs 
> available there not only spark-env.
> This involves changing SparkSubmitArguments to first read from 
> SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
> computed classpath for configs such as log4j, metrics, etc.
> I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2058:
-
Assignee: Eugen Cepoi

> SPARK_CONF_DIR should override all present configs
> --
>
> Key: SPARK-2058
> URL: https://issues.apache.org/jira/browse/SPARK-2058
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Eugen Cepoi
>Assignee: Eugen Cepoi
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> When the user defines SPARK_CONF_DIR I think spark should use all the configs 
> available there not only spark-env.
> This involves changing SparkSubmitArguments to first read from 
> SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
> computed classpath for configs such as log4j, metrics, etc.
> I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3706) Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset

2014-10-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158200#comment-14158200
 ] 

Josh Rosen commented on SPARK-3706:
---

This introduced a problem, since after this patch we now use IPython on the 
workers; see SPARK-3772 for more details.

> Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
> 
>
> Key: SPARK-3706
> URL: https://issues.apache.org/jira/browse/SPARK-3706
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
>Reporter: cocoatomo
>  Labels: pyspark
> Fix For: 1.2.0
>
>
> h3. Problem
> The section "Using the shell" in Spark Programming Guide 
> (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) 
> says that we can run pyspark REPL through IPython.
> But a folloing command does not run IPython but a default Python executable.
> {quote}
> $ IPYTHON=1 ./bin/pyspark
> Python 2.7.8 (default, Jul  2 2014, 10:14:46) 
> ...
> {quote}
> the spark/bin/pyspark script on the commit 
> b235e013638685758885842dc3268e9800af3678 decides which executable and options 
> it use folloing way.
> # if PYSPARK_PYTHON unset
> #* → defaulting to "python"
> # if IPYTHON_OPTS set
> #* → set IPYTHON "1"
> # some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
> #* out of this issues scope
> # if IPYTHON set as "1"
> #* → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
> #* otherwise execute $PYSPARK_PYTHON
> Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is 
> "1".
> In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no 
> effect on decide which command to use.
> ||PYSPARK_PYTHON||IPYTHON_OPTS||IPYTHON||resulting command||expected command||
> |(unset → defaults to python)|(unset)|(unset)|python|(same)|
> |(unset → defaults to python)|(unset)|1|python|ipython|
> |(unset → defaults to python)|an_option|(unset → set to 1)|python 
> an_option|ipython an_option|
> |(unset → defaults to python)|an_option|1|python an_option|ipython an_option|
> |ipython|(unset)|(unset)|ipython|(same)|
> |ipython|(unset)|1|ipython|(same)|
> |ipython|an_option|(unset → set to 1)|ipython an_option|(same)|
> |ipython|an_option|1|ipython an_option|(same)|
> h3. Suggestion
> The pyspark script should determine firstly whether a user wants to run 
> IPython or other executables.
> # if IPYTHON_OPTS set
> #* set IPYTHON "1"
> # if IPYTHON has a value "1"
> #* PYSPARK_PYTHON defaults to "ipython" if not set
> # PYSPARK_PYTHON defaults to "python" if not set
> See the pull request for more detailed modification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2058) SPARK_CONF_DIR should override all present configs

2014-10-03 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158197#comment-14158197
 ] 

Andrew Or commented on SPARK-2058:
--

To give a quick update, this change has not made it to any releases yet. It 
will be in the future releases 1.1.1 and 1.2.0, however.

> SPARK_CONF_DIR should override all present configs
> --
>
> Key: SPARK-2058
> URL: https://issues.apache.org/jira/browse/SPARK-2058
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
>Reporter: Eugen Cepoi
>Assignee: Eugen Cepoi
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> When the user defines SPARK_CONF_DIR I think spark should use all the configs 
> available there not only spark-env.
> This involves changing SparkSubmitArguments to first read from 
> SPARK_CONF_DIR, and updating the scripts to add SPARK_CONF_DIR to the 
> computed classpath for configs such as log4j, metrics, etc.
> I have already prepared a PR for this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3786) Speedup tests of PySpark

2014-10-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3786:
-

 Summary: Speedup tests of PySpark
 Key: SPARK-3786
 URL: https://issues.apache.org/jira/browse/SPARK-3786
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


It takes about 20 minutes (about 25% of all the tests) to run all the tests of 
PySpark.

The slowest ones are tests.py and streaming/tests.py, they create new JVM and 
SparkContext for each test cases, it will be faster to reuse the SparkContext 
for most of cases.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3696) Do not override user-defined conf_dir in spark-config.sh

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3696:
-
Assignee: WangTaoTheTonic

> Do not override user-defined conf_dir in spark-config.sh
> 
>
> Key: SPARK-3696
> URL: https://issues.apache.org/jira/browse/SPARK-3696
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: WangTaoTheTonic
>Assignee: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> Now many scripts used spark-config.sh in which SPARK_CONF_DIR is directly 
> assigned with SPARK_HOME/conf. It is inconvenient for those who define 
> SPARK_CONF_DIR in env.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3696) Do not override user-defined conf_dir in spark-config.sh

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3696.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0

> Do not override user-defined conf_dir in spark-config.sh
> 
>
> Key: SPARK-3696
> URL: https://issues.apache.org/jira/browse/SPARK-3696
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: WangTaoTheTonic
>Assignee: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> Now many scripts used spark-config.sh in which SPARK_CONF_DIR is directly 
> assigned with SPARK_HOME/conf. It is inconvenient for those who define 
> SPARK_CONF_DIR in env.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.

2014-10-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1655:
-
Assignee: Aaron Staple

> In naive Bayes, store conditional probabilities distributively.
> ---
>
> Key: SPARK-1655
> URL: https://issues.apache.org/jira/browse/SPARK-1655
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
>
> In the current implementation, we collect all conditional probabilities to 
> the driver node. When there are many labels and many features, this puts 
> heavy load on the driver. For scalability, we should provide a way to store 
> conditional probabilities distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2693.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2620
[https://github.com/apache/spark/pull/2620]

> Support for UDAF Hive Aggregates like PERCENTILE
> 
>
> Key: SPARK-2693
> URL: https://issues.apache.org/jira/browse/SPARK-2693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Ravindra Pesala
>Priority: Critical
> Fix For: 1.2.0
>
>
> {code}
> SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), 
> year,month,day FROM  raw_data_table  GROUP BY year, month, day
> MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an 
> error as shown below.
> Exception in thread "main" java.lang.RuntimeException: No handler for udf 
> class org.apache.hadoop.hive.ql.udf.UDAFPercentile
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
> {code}
> This aggregate extends UDAF, which we don't yet have a wrapper for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2778) Add unit tests for Yarn integration

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2778.

  Resolution: Fixed
Target Version/s: 1.2.0

> Add unit tests for Yarn integration
> ---
>
> Key: SPARK-2778
> URL: https://issues.apache.org/jira/browse/SPARK-2778
> Project: Spark
>  Issue Type: Test
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
> Attachments: yarn-logs.txt
>
>
> It would be nice to add some Yarn integration tests to the unit tests in 
> Spark; Yarn provides a "MiniYARNCluster" class that can be used to spawn a 
> cluster locally.
> UPDATE: These tests are causing exceptions in our nightly build:
> {code}
> sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>   at org.apache.spark.SparkContext.(SparkContext.scala:310)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 

[jira] [Updated] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3710:
-
Affects Version/s: 1.2.0

> YARN integration test is flaky
> --
>
> Key: SPARK-3710
> URL: https://issues.apache.org/jira/browse/SPARK-3710
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.0
>
>
> This has been regularly failing the master build:
> Example failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
> One thing to look at is whether the YARN mini cluster makes assumptions about 
> being able to bind to specific ports.
> {code}
> sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>   at org.apache.spark.SparkContext.(SparkContext.scala:310)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest

[jira] [Closed] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3710.

   Resolution: Fixed
Fix Version/s: 1.2.0

> YARN integration test is flaky
> --
>
> Key: SPARK-3710
> URL: https://issues.apache.org/jira/browse/SPARK-3710
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.0
>
>
> This has been regularly failing the master build:
> Example failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
> One thing to look at is whether the YARN mini cluster makes assumptions about 
> being able to bind to specific ports.
> {code}
> sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>   at org.apache.spark.SparkContext.(SparkContext.scala:310)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   

[jira] [Commented] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158326#comment-14158326
 ] 

Andrew Or commented on SPARK-3710:
--

https://github.com/apache/spark/pull/2605

> YARN integration test is flaky
> --
>
> Key: SPARK-3710
> URL: https://issues.apache.org/jira/browse/SPARK-3710
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.0
>
>
> This has been regularly failing the master build:
> Example failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
> One thing to look at is whether the YARN mini cluster makes assumptions about 
> being able to bind to specific ports.
> {code}
> sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>   at org.apache.spark.SparkContext.(SparkContext.scala:310)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scal

[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158350#comment-14158350
 ] 

Marcelo Vanzin commented on SPARK-3761:
---

How exactly are you packaging and submitting your application? What are the 
contents of your app's jar file? CDH doesn't support Windows, but still, that 
information can help figure out where the problem lies.

> Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
> -
>
> Key: SPARK-3761
> URL: https://issues.apache.org/jira/browse/SPARK-3761
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Igor Tkachenko
>Priority: Blocker
>
> I have Scala code:
> val master = "spark://:7077"
> val sc = new SparkContext(new SparkConf()
>   .setMaster(master)
>   .setAppName("SparkQueryDemo 01")
>   .set("spark.executor.memory", "512m"))
> val count2 = sc .textFile("hdfs:// address>:8020/tmp/data/risk/account.txt")
>   .filter(line  => line.contains("Word"))
>   .count()
> I've got such an error:
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0.0:0 failed 4 times, most
> recent failure: Exception failure in TID 6 on host : 
> java.lang.ClassNotFoundExcept
> ion: SimpleApp$$anonfun$1
> My dependencies :
> object Version {
>   val spark= "1.0.0-cdh5.1.0"
> }
> object Library {
>   val sparkCore  = "org.apache.spark"  % "spark-assembly_2.10"  % 
> Version.spark
> }
> My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-03 Thread Reza Farivar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158363#comment-14158363
 ] 

Reza Farivar commented on SPARK-3785:
-

Olivier Chafik who wrote javacl (which you mentioned in your description) also 
has a beta stage scalacl package on github
https://github.com/ochafik/ScalaCL

There was also another project trying to get opencl in java: aparapi. The neat 
thing about aparapi is that it doesn't require you to write opencl kernels in 
C, but would translate java loops into opencl code on the run. Seems like 
ScalaCL project has similar goals for scala. 

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158364#comment-14158364
 ] 

Marcelo Vanzin commented on SPARK-3710:
---

Hmm. For some the e-mail for this bug ended up in my spam box. Anyway, fix was 
tracked also in SPARK-2778.

> YARN integration test is flaky
> --
>
> Key: SPARK-3710
> URL: https://issues.apache.org/jira/browse/SPARK-3710
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.0
>
>
> This has been regularly failing the master build:
> Example failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
> One thing to look at is whether the YARN mini cluster makes assumptions about 
> being able to bind to specific ports.
> {code}
> sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>   at org.apache.spark.SparkContext.(SparkContext.scala:310)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scal

[jira] [Commented] (SPARK-3710) YARN integration test is flaky

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158376#comment-14158376
 ] 

Marcelo Vanzin commented on SPARK-3710:
---

I filed a Yarn bug (YARN-2642), although we can't get rid of the workaround 
since we need to support existing versions of Yarn.

> YARN integration test is flaky
> --
>
> Key: SPARK-3710
> URL: https://issues.apache.org/jira/browse/SPARK-3710
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Patrick Wendell
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.2.0
>
>
> This has been regularly failing the master build:
> Example failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/738/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/#showFailuresLink
> One thing to look at is whether the YARN mini cluster makes assumptions about 
> being able to bind to specific ports.
> {code}
> sbt.ForkMain$ForkError: Call From test04.amplab/10.123.1.2 to test04.amplab:0 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.GeneratedConstructorAccessor5.newInstance(Unknown Source)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1351)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1300)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy11.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:152)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy12.getClusterMetrics(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:246)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$submitApplication$1.apply(Client.scala:69)
>   at org.apache.spark.Logging$class.logInfo(Logging.scala:59)
>   at org.apache.spark.deploy.yarn.Client.logInfo(Client.scala:35)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:68)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:61)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
>   at org.apache.spark.SparkContext.(SparkContext.scala:310)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriver$.main(YarnClusterSuite.scala:140)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply$mcV$sp(YarnClusterSuite.scala:91)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$1.apply(YarnClusterSuite.scala:89)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306

[jira] [Resolved] (SPARK-3007) Add "Dynamic Partition" support to Spark Sql hive

2014-10-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3007.

Resolution: Fixed

Okay, this was merged again:

https://github.com/apache/spark/pull/2616

> Add "Dynamic Partition" support  to  Spark Sql hive
> ---
>
> Key: SPARK-3007
> URL: https://issues.apache.org/jira/browse/SPARK-3007
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: baishuo
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3212) Improve the clarity of caching semantics

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3212.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2501
[https://github.com/apache/spark/pull/2501]

> Improve the clarity of caching semantics
> 
>
> Key: SPARK-3212
> URL: https://issues.apache.org/jira/browse/SPARK-3212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
> Fix For: 1.2.0
>
>
> Right now there are a bunch of different ways to cache tables in Spark SQL. 
> For example:
>  - tweets.cache()
>  - sql("SELECT * FROM tweets").cache()
>  - table("tweets").cache()
>  - tweets.cache().registerTempTable(tweets)
>  - sql("CACHE TABLE tweets")
>  - cacheTable("tweets")
> Each of the above commands has subtly different semantics, leading to a very 
> confusing user experience.  Ideally, we would stop doing caching based on 
> simple tables names and instead have a phase of optimization that does 
> intelligent matching of query plans with available cached data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1379) Calling .cache() on a SchemaRDD should do something more efficient than caching the individual row objects.

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1379.
-
Resolution: Fixed

> Calling .cache() on a SchemaRDD should do something more efficient than 
> caching the individual row objects.
> ---
>
> Key: SPARK-1379
> URL: https://issues.apache.org/jira/browse/SPARK-1379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> Since rows aren't black boxes we could use InMemoryColumnarTableScan.  This 
> would significantly reduce GC pressure on the workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3641.
-
Resolution: Fixed
  Assignee: Michael Armbrust  (was: Yin Huai)

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Assignee: Michael Armbrust
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1671) Cached tables should follow write-through policy

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1671.
-
Resolution: Fixed

I'm gonna mark this as resolved now that we do at least invalidate the cache 
when writing through.  We can create a follow up JIRA for partial invalidation 
if we want.

> Cached tables should follow write-through policy
> 
>
> Key: SPARK-1671
> URL: https://issues.apache.org/jira/browse/SPARK-1671
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Cheng Lian
>Assignee: Michael Armbrust
>  Labels: cache, column
>
> Writing (insert / load) to a cached table causes cache inconsistency, and 
> user have to unpersist and cache the whole table again.
> The write-through policy may be implemented with {{RDD.union}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job

2014-10-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:

Assignee: Cheng Lian  (was: Michael Armbrust)

> Add a way to show tables without executing a job
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3535.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

> Spark on Mesos not correctly setting heap overhead
> --
>
> Key: SPARK-3535
> URL: https://issues.apache.org/jira/browse/SPARK-3535
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.1.0
>Reporter: Brenden Matthews
>Assignee: Brenden Matthews
> Fix For: 1.1.1, 1.2.0
>
>
> Spark on Mesos does account for any memory overhead.  The result is that 
> tasks are OOM killed nearly 95% of the time.
> Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
> executor memory for JVM overhead.
> For example, see: 
> https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3535) Spark on Mesos not correctly setting heap overhead

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3535:
-
Assignee: Brenden Matthews

> Spark on Mesos not correctly setting heap overhead
> --
>
> Key: SPARK-3535
> URL: https://issues.apache.org/jira/browse/SPARK-3535
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.1.0
>Reporter: Brenden Matthews
>Assignee: Brenden Matthews
>
> Spark on Mesos does account for any memory overhead.  The result is that 
> tasks are OOM killed nearly 95% of the time.
> Like with the Hadoop on Mesos project, Spark should set aside 15-25% of the 
> executor memory for JVM overhead.
> For example, see: 
> https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/ResourcePolicy.java#L55-L63



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3775:
-
Affects Version/s: 1.1.0

> Not suitable error message in spark-shell.cmd
> -
>
> Key: SPARK-3775
> URL: https://issues.apache.org/jira/browse/SPARK-3775
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Masayoshi TSUZUKI
>Priority: Trivial
>
> In Windows environment.
> When we execute bin\spark-shell.cmd before we build spark, we get the error 
> message like this.
> {quote}
> Failed to find Spark assembly JAR.
> You need to build Spark with sbt\sbt assembly before running this program.
> {quote}
> But this message is not suitable because ...
> * Maven is also available to build Spark, and it works in Windows without 
> cygwin now ([SPARK-3061]).
> * The equivalent error message of linux version (bin/spark-shell) doesn't 
> mention the way to build.
> bq. You need to build Spark before running this program.
> * sbt\sbt can't be executed in Windows without cygwin because it's bash 
> script.
> So this message should be modified as same as the linux version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3775) Not suitable error message in spark-shell.cmd

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3775.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.1.1, 1.2.0

> Not suitable error message in spark-shell.cmd
> -
>
> Key: SPARK-3775
> URL: https://issues.apache.org/jira/browse/SPARK-3775
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Trivial
> Fix For: 1.1.1, 1.2.0
>
>
> In Windows environment.
> When we execute bin\spark-shell.cmd before we build spark, we get the error 
> message like this.
> {quote}
> Failed to find Spark assembly JAR.
> You need to build Spark with sbt\sbt assembly before running this program.
> {quote}
> But this message is not suitable because ...
> * Maven is also available to build Spark, and it works in Windows without 
> cygwin now ([SPARK-3061]).
> * The equivalent error message of linux version (bin/spark-shell) doesn't 
> mention the way to build.
> bq. You need to build Spark before running this program.
> * sbt\sbt can't be executed in Windows without cygwin because it's bash 
> script.
> So this message should be modified as same as the linux version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3774) typo comment in bin/utils.sh

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3774.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.1.1, 1.2.0

> typo comment in bin/utils.sh
> 
>
> Key: SPARK-3774
> URL: https://issues.apache.org/jira/browse/SPARK-3774
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Shell
>Affects Versions: 1.1.0
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Trivial
> Fix For: 1.1.1, 1.2.0
>
>
> typo comment in bin/utils.sh
> {code}
> # Gather all all spark-submit options into SUBMISSION_OPTS
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
Fix Version/s: 1.2.0

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
 Target Version/s: 1.1.1, 1.2.0
Affects Version/s: (was: 1.2.0)

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
Affects Version/s: 1.2.0

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3763) The example of building with sbt should be "sbt assembly" instead of "sbt compile"

2014-10-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3763.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

> The example of building with sbt should be "sbt assembly" instead of "sbt 
> compile"
> --
>
> Key: SPARK-3763
> URL: https://issues.apache.org/jira/browse/SPARK-3763
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Trivial
> Fix For: 1.2.0
>
>
> In building-spark.md, there are some examples for making assembled package 
> with maven but the example for building with sbt is only about for compiling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-1860.
---
Resolution: Fixed

Fixed by mccheah in https://github.com/apache/spark/pull/2609

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3786) Speedup tests of PySpark

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158552#comment-14158552
 ] 

Apache Spark commented on SPARK-3786:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2646

> Speedup tests of PySpark
> 
>
> Key: SPARK-3786
> URL: https://issues.apache.org/jira/browse/SPARK-3786
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> It takes about 20 minutes (about 25% of all the tests) to run all the tests 
> of PySpark.
> The slowest ones are tests.py and streaming/tests.py, they create new JVM and 
> SparkContext for each test cases, it will be faster to reuse the SparkContext 
> for most of cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-10-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158571#comment-14158571
 ] 

Sandy Ryza commented on SPARK-3561:
---

I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an "execution environment", but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That, 
the Spark API is currently very tightly integrated with its execution engine, 
and frankly, decoupling the two so that Spark would be able to run on top of 
execution engines with similar properties seems more trouble than its worth.

> Native Hadoop/YARN integration for batch/ETL workloads
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Summary: Decouple Spark's API from its execution engine  (was: Native 
Hadoop/YARN integration for batch/ETL workloads)

> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3787:
-

 Summary: Assembly jar name is wrong when we build with sbt 
omitting -Dhadoop.version
 Key: SPARK-3787
 URL: https://issues.apache.org/jira/browse/SPARK-3787
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3787:
--
Description: 
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}

jar name is always used default version (1.0.4).

When we build with maven with same condition for sbt, default version for each 
profile.
For instance, if we  build like:

{code}
mvn -Phadoop-2.2 package
{code}

jar name is used hadoop2.2.0 as a default version of hadoop-2.2.

  was:
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}


> Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
> ---
>
> Key: SPARK-3787
> URL: https://issues.apache.org/jira/browse/SPARK-3787
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>
> When we build with sbt with profile for hadoop and without property for 
> hadoop version like:
> {code}
> sbt/sbt -Phadoop-2.2 assembly
> {code}
> jar name is always used default version (1.0.4).
> When we build with maven with same condition for sbt, default version for 
> each profile.
> For instance, if we  build like:
> {code}
> mvn -Phadoop-2.2 package
> {code}
> jar name is used hadoop2.2.0 as a default version of hadoop-2.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158602#comment-14158602
 ] 

Apache Spark commented on SPARK-3787:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2647

> Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
> ---
>
> Key: SPARK-3787
> URL: https://issues.apache.org/jira/browse/SPARK-3787
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>
> When we build with sbt with profile for hadoop and without property for 
> hadoop version like:
> {code}
> sbt/sbt -Phadoop-2.2 assembly
> {code}
> jar name is always used default version (1.0.4).
> When we build with maven with same condition for sbt, default version for 
> each profile.
> For instance, if we  build like:
> {code}
> mvn -Phadoop-2.2 package
> {code}
> jar name is used hadoop2.2.0 as a default version of hadoop-2.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3787) Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version

2014-10-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3787:
--
Description: 
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}

jar name is always used default version (1.0.4).

When we build with maven with same condition for sbt, default version for each 
profile is used.
For instance, if we  build like:

{code}
mvn -Phadoop-2.2 package
{code}

jar name is used hadoop2.2.0 as a default version of hadoop-2.2.

  was:
When we build with sbt with profile for hadoop and without property for hadoop 
version like:

{code}
sbt/sbt -Phadoop-2.2 assembly
{code}

jar name is always used default version (1.0.4).

When we build with maven with same condition for sbt, default version for each 
profile.
For instance, if we  build like:

{code}
mvn -Phadoop-2.2 package
{code}

jar name is used hadoop2.2.0 as a default version of hadoop-2.2.


> Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version
> ---
>
> Key: SPARK-3787
> URL: https://issues.apache.org/jira/browse/SPARK-3787
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>
> When we build with sbt with profile for hadoop and without property for 
> hadoop version like:
> {code}
> sbt/sbt -Phadoop-2.2 assembly
> {code}
> jar name is always used default version (1.0.4).
> When we build with maven with same condition for sbt, default version for 
> each profile is used.
> For instance, if we  build like:
> {code}
> mvn -Phadoop-2.2 package
> {code}
> jar name is used hadoop2.2.0 as a default version of hadoop-2.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158614#comment-14158614
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by helping with resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 

> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158614#comment-14158614
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 10/3/14 10:34 PM:
---

[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by facilitating resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 


was (Author: ozhurakousky):
[~sandyr]

Indeed YARN is a _resource manager_ that supports multiple execution 
environments by helping with resource allocation and management. On the other 
hand, Spark, Tez and many other (custom) execution environments are currently 
run on YARN. (NOTE: Custom execution environments on YARN are becoming very 
common in large enterprises). Such decoupling will ensure that Spark can 
integrate with any and all (where applicable) in a pluggable and extensible 
fashion. 

> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark provides integration with external resource-managers such as 
Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current 
architecture of Spark-on-YARN can be enhanced to provide significantly better 
utilization of cluster resources for large scale, batch and/or ETL applications 
when run alongside other applications (Spark and others) and services in YARN. 

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's API is tightly coupled with its backend execution engine.   
> It could be useful to provide a point of pluggability between the two to 
> allow Spark to run on other DAG execution engines with similar distributed 
> memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
a gateway and a delegate to Hadoop execution environment - as a non-public api 
(@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's API is tightly coupled with its backend execution engine.   
> It could be useful to provide a point of pluggability between the two to 
> allow Spark to run on other DAG execution engines with similar distributed 
> memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3561:
--
Description: 
Currently Spark's user-facing API is tightly coupled with its backend execution 
engine.   It could be useful to provide a point of pluggability between the two 
to allow Spark to run on other DAG execution engines with similar distributed 
memory abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well

  was:
Currently Spark's API is tightly coupled with its backend execution engine.   
It could be useful to provide a point of pluggability between the two to allow 
Spark to run on other DAG execution engines with similar distributed memory 
abstractions.

Proposal:
The proposed approach would introduce a pluggable JobExecutionContext (trait) - 
as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
The trait will define 4 only operations:
* hadoopFile
* newAPIHadoopFile
* broadcast
* runJob

Each method directly maps to the corresponding methods in current version of 
SparkContext. JobExecutionContext implementation will be accessed by 
SparkContext via master URL as 
"execution-context:foo.bar.MyJobExecutionContext" with default implementation 
containing the existing code from SparkContext, thus allowing current 
(corresponding) methods of SparkContext to delegate to such implementation. An 
integrator will now have an option to provide custom implementation of 
DefaultExecutionContext by either implementing it from scratch or extending 
form DefaultExecutionContext.

Please see the attached design doc for more details.
Pull Request will be posted shortly as well


> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's user-facing API is tightly coupled with its backend 
> execution engine.   It could be useful to provide a point of pluggability 
> between the two to allow Spark to run on other DAG execution engines with 
> similar distributed memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3561) Decouple Spark's API from its execution engine

2014-10-03 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158571#comment-14158571
 ] 

Sandy Ryza edited comment on SPARK-3561 at 10/3/14 11:00 PM:
-

I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an "execution environment", but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That 
said, the Spark API is currently very tightly integrated with its execution 
engine, and frankly, decoupling the two so that Spark would be able to run on 
top of execution engines with similar properties seems more trouble than its 
worth.


was (Author: sandyr):
I think there may be somewhat of a misunderstanding about the relationship 
between Spark and YARN.  YARN is not an "execution environment", but a cluster 
resource manager that has the ability to start processes on behalf of execution 
engines like Spark.  Spark already supports YARN as a cluster resource manager, 
but YARN doesn't provide its own execution engine.  YARN doesn't provide a 
stateless shuffle (although execution engines built atop it like MR and Tez 
do). 

If I understand, the broader intent is to decouple the Spark API from the 
execution engine it runs on top of.  Changing the title to reflect this.  That, 
the Spark API is currently very tightly integrated with its execution engine, 
and frankly, decoupling the two so that Spark would be able to run on top of 
execution engines with similar properties seems more trouble than its worth.

> Decouple Spark's API from its execution engine
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark's user-facing API is tightly coupled with its backend 
> execution engine.   It could be useful to provide a point of pluggability 
> between the two to allow Spark to run on other DAG execution engines with 
> similar distributed memory abstractions.
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - as a non-public api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-3788:
-

 Summary: Yarn dist cache code is not friendly to HDFS HA, 
Federation
 Key: SPARK-3788
 URL: https://issues.apache.org/jira/browse/SPARK-3788
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Marcelo Vanzin


There are two bugs here.

1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
URI to be an actual host. In the case of HA and Federation, that's a namespace 
name, which doesn't resolve to anything. So in those cases, {{compareFs()}} 
always says the file systems are different.

2. In {{prepareLocalResources()}}, when adding a file to the distributed cache, 
that is done with the common FileSystem object instantiated at the start of the 
method. In the case of Federation that doesn't work: the qualified URL's scheme 
may differ from the non-qualified one, so the FileSystem instance will not work.

Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-03 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158646#comment-14158646
 ] 

Andrew Ash commented on SPARK-1860:
---

[~ilikerps] this ticket mentioned turning the cleanup code on by default once 
this ticket was fixed.  Should we change the defaults to have this on by 
default?

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158657#comment-14158657
 ] 

Marcelo Vanzin commented on SPARK-3788:
---

Note: "2" above only applies to branch-1.1. It was fixed in master by 
https://github.com/apache/spark/commit/c4022dd5.

> Yarn dist cache code is not friendly to HDFS HA, Federation
> ---
>
> Key: SPARK-3788
> URL: https://issues.apache.org/jira/browse/SPARK-3788
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> There are two bugs here.
> 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
> URI to be an actual host. In the case of HA and Federation, that's a 
> namespace name, which doesn't resolve to anything. So in those cases, 
> {{compareFs()}} always says the file systems are different.
> 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
> cache, that is done with the common FileSystem object instantiated at the 
> start of the method. In the case of Federation that doesn't work: the 
> qualified URL's scheme may differ from the non-qualified one, so the 
> FileSystem instance will not work.
> Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158707#comment-14158707
 ] 

Apache Spark commented on SPARK-3788:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2649

> Yarn dist cache code is not friendly to HDFS HA, Federation
> ---
>
> Key: SPARK-3788
> URL: https://issues.apache.org/jira/browse/SPARK-3788
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> There are two bugs here.
> 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
> URI to be an actual host. In the case of HA and Federation, that's a 
> namespace name, which doesn't resolve to anything. So in those cases, 
> {{compareFs()}} always says the file systems are different.
> 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
> cache, that is done with the common FileSystem object instantiated at the 
> start of the method. In the case of Federation that doesn't work: the 
> qualified URL's scheme may differ from the non-qualified one, so the 
> FileSystem instance will not work.
> Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158710#comment-14158710
 ] 

Marcelo Vanzin commented on SPARK-3788:
---

Ah, "2" was fixed in branch-1.1 as part of SPARK-2577. So only issue 1 remains.

> Yarn dist cache code is not friendly to HDFS HA, Federation
> ---
>
> Key: SPARK-3788
> URL: https://issues.apache.org/jira/browse/SPARK-3788
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> There are two bugs here.
> 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
> URI to be an actual host. In the case of HA and Federation, that's a 
> namespace name, which doesn't resolve to anything. So in those cases, 
> {{compareFs()}} always says the file systems are different.
> 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
> cache, that is done with the common FileSystem object instantiated at the 
> start of the method. In the case of Federation that doesn't work: the 
> qualified URL's scheme may differ from the non-qualified one, so the 
> FileSystem instance will not work.
> Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3789) Python bindings for GraphX

2014-10-03 Thread Ameet Talwalkar (JIRA)
Ameet Talwalkar created SPARK-3789:
--

 Summary: Python bindings for GraphX
 Key: SPARK-3789
 URL: https://issues.apache.org/jira/browse/SPARK-3789
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, PySpark
Reporter: Ameet Talwalkar






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3788) Yarn dist cache code is not friendly to HDFS HA, Federation

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158725#comment-14158725
 ] 

Apache Spark commented on SPARK-3788:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2650

> Yarn dist cache code is not friendly to HDFS HA, Federation
> ---
>
> Key: SPARK-3788
> URL: https://issues.apache.org/jira/browse/SPARK-3788
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> There are two bugs here.
> 1. The {{compareFs()}} method in ClientBase considers the 'host' part of the 
> URI to be an actual host. In the case of HA and Federation, that's a 
> namespace name, which doesn't resolve to anything. So in those cases, 
> {{compareFs()}} always says the file systems are different.
> 2. In {{prepareLocalResources()}}, when adding a file to the distributed 
> cache, that is done with the common FileSystem object instantiated at the 
> start of the method. In the case of Federation that doesn't work: the 
> qualified URL's scheme may differ from the non-qualified one, so the 
> FileSystem instance will not work.
> Fixes are pretty trivial.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3314) Script creation of AMIs

2014-10-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158777#comment-14158777
 ] 

Nicholas Chammas commented on SPARK-3314:
-

Hey [~holdenk], I think this is a great issue to work on. There was a related 
discussion on the dev list about using [Packer|http://www.packer.io/] to do 
this. I will be looking into this option and will report back here.

> Script creation of AMIs
> ---
>
> Key: SPARK-3314
> URL: https://issues.apache.org/jira/browse/SPARK-3314
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: holdenk
>Priority: Minor
>
> The current Spark AMIs have been built up over time. It would be useful to 
> provide a script which can be used to bootstrap from a fresh Amazon AMI. We 
> could also update the AMIs in the project at the same time to be based on a 
> newer version so we don't have to wait so long for the security updates to be 
> installed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158865#comment-14158865
 ] 

Apache Spark commented on SPARK-3772:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2651

> RDD operation on IPython REPL failed with an illegal port number
> 
>
> Key: SPARK-3772
> URL: https://issues.apache.org/jira/browse/SPARK-3772
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
>Reporter: cocoatomo
>  Labels: pyspark
>
> To reproduce this issue, we should execute following commands on the commit: 
> 6e27cb630de69fa5acb510b4e2f6b980742b1957.
> {quote}
> $ PYSPARK_PYTHON=ipython ./bin/pyspark
> ...
> In [1]: file = sc.textFile('README.md')
> In [2]: file.first()
> ...
> 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
> 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
> 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at 
> PythonRDD.scala:334
> 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at 
> PythonRDD.scala:334) with 1 output partitions (allowLocal=true)
> 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
> PythonRDD.scala:334)
> 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
> 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
> 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD 
> at PythonRDD.scala:44), which has no missing parents
> 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
> curMem=57388, maxMem=278019440
> 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
> memory (estimated size 4.4 KB, free 265.1 MB)
> 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
> (PythonRDD[2] at RDD at PythonRDD.scala:44)
> 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
> 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
> localhost, PROCESS_LOCAL, 1207 bytes)
> 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.IllegalArgumentException: port out of range:1027423549
>   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
>   at java.net.InetSocketAddress.(InetSocketAddress.java:188)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:744)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-03 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158949#comment-14158949
 ] 

Reza Zadeh commented on SPARK-3434:
---

Any updates Shivaraman?

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3790) CosineSimilarity via DIMSUM example

2014-10-03 Thread Reza Zadeh (JIRA)
Reza Zadeh created SPARK-3790:
-

 Summary: CosineSimilarity via DIMSUM example
 Key: SPARK-3790
 URL: https://issues.apache.org/jira/browse/SPARK-3790
 Project: Spark
  Issue Type: Improvement
Reporter: Reza Zadeh


Create an example that gives approximation error for DIMSUM using arbitrary 
RowMatrix given via commandline.

PR tracking this:
https://github.com/apache/spark/pull/2622



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org