[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163125#comment-14163125
 ] 

Sean Owen commented on SPARK-3785:
--

Sure, but a GPU isn't going to be good at general map, filter, reduce, groupBy 
operations. It can't run arbitrary functions like the JVM. I wonder how many 
use cases actually consist of enough computation that can be specialized for 
the GPU, chained together, that makes the GPU worth it. My suspicious is still 
that there are really a few wins for this use case but that they are achievable 
by just calling to the GPU from Java code. I'd love to see that this is in fact 
a way to transparently speed up a non-trivial slice of mainstream Spark use 
cases though.

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3836) Spark REPL optionally propagate internal exceptions

2014-10-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3836.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Ahir Reddy

https://github.com/apache/spark/pull/2695

> Spark REPL optionally propagate internal exceptions 
> 
>
> Key: SPARK-3836
> URL: https://issues.apache.org/jira/browse/SPARK-3836
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ahir Reddy
>Assignee: Ahir Reddy
>Priority: Minor
> Fix For: 1.2.0
>
>
> Optionally have the repl throw exceptions generated by interpreted code, 
> instead of swallowing the exception and returning it as text output. This is 
> useful when embedding the repl, otherwise it's not possible to know when user 
> code threw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3807) SparkSql does not work for tables created using custom serde

2014-10-07 Thread chirag aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chirag aggarwal updated SPARK-3807:
---
Description: 
SparkSql crashes on selecting tables using custom serde. 

Example:


CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE 
"org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with 
serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class")
 STORED AS SEQUENCEFILE;

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:100)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
 
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
 
at 
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
 
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
 
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
 
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
 
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
at java.lang.reflect.Method.invoke(Unknown Source) 
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) 
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) 
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
Caused by: java.lang.NullPointerException


After fixing this issue, when some columns in the table were referred in the 
query, sparksql could not resolve those references.

  was:
SparkSql crashes on selecting tables using custom serde. 

Example:


CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE 
"org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with 
serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class")
 STORED AS SEQUENCEFILE;

The following exception is seen on running a query like 'select * from 
table_name limit 1': 

ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.NullPointerException 
at 
org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
 
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
 
at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:100)
 
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(Hi

[jira] [Commented] (SPARK-2811) update algebird to 0.8.1

2014-10-07 Thread Dan Di Spaltro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163034#comment-14163034
 ] 

Dan Di Spaltro commented on SPARK-2811:
---

Looks like algebird_2.11 artifacts are on maven central.

http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22algebird_2.11%22

> update algebird to 0.8.1
> 
>
> Key: SPARK-2811
> URL: https://issues.apache.org/jira/browse/SPARK-2811
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Reporter: Anand Avati
>
> First algebird_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163027#comment-14163027
 ] 

Apache Spark commented on SPARK-2426:
-

User 'debasish83' has created a pull request for this issue:
https://github.com/apache/spark/pull/2705

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3781) code style format

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163018#comment-14163018
 ] 

Apache Spark commented on SPARK-3781:
-

User 'shijinkui' has created a pull request for this issue:
https://github.com/apache/spark/pull/2704

> code style format
> -
>
> Key: SPARK-3781
> URL: https://issues.apache.org/jira/browse/SPARK-3781
> Project: Spark
>  Issue Type: Improvement
>Reporter: sjk
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1656) Potential resource leak in HttpBroadcast, SparkSubmitArguments, FileSystemPersistenceEngine and DiskStore

2014-10-07 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-1656.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

Already merged into master and branch-1.1

> Potential resource leak in HttpBroadcast, SparkSubmitArguments, 
> FileSystemPersistenceEngine and DiskStore
> -
>
> Key: SPARK-1656
> URL: https://issues.apache.org/jira/browse/SPARK-1656
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: easyfix
> Fix For: 1.1.1, 1.2.0
>
>
> Again... I'm trying to review all `close` statements to find such issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3808) PySpark fails to start in Windows

2014-10-07 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162935#comment-14162935
 ] 

Masayoshi TSUZUKI commented on SPARK-3808:
--

I verified and it works well! Thank you [~andrewor14]

> PySpark fails to start in Windows
> -
>
> Key: SPARK-3808
> URL: https://issues.apache.org/jira/browse/SPARK-3808
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 1.2.0
> Environment: Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Blocker
> Fix For: 1.2.0
>
>
> When we execute bin\pyspark.cmd in Windows, it fails to start.
> We get following messages.
> {noformat}
> C:\>bin\pyspark.cmd
> Running C:\\python.exe with 
> PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python;
> Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> ="x" was unexpected at this time.
> Traceback (most recent call last):
>   File "C:\\bin\..\python\pyspark\shell.py", line 45, in 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
>   File "C:\\python\pyspark\context.py", line 103, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway
> raise Exception(error_msg)
> Exception: Launching GatewayServer failed with exit code 255!
> Warning: Expected GatewayServer to output a port, but found no output.
> >>>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-10-07 Thread Xu Zhongxing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162913#comment-14162913
 ] 

Xu Zhongxing commented on SPARK-3005:
-

Resolved in https://github.com/apache/spark/pull/2453

> Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
> MesosSchedulerBackend.killTask()
> ---
>
> Key: SPARK-3005
> URL: https://issues.apache.org/jira/browse/SPARK-3005
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
>Reporter: Xu Zhongxing
> Attachments: SPARK-3005_1.diff
>
>
> I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
> cassandra cluster.
> During the job running, I killed the Cassandra daemon to simulate some 
> failure cases. This results in task failures.
> If I run the job in Mesos coarse-grained mode, the spark driver program 
> throws an exception and shutdown cleanly.
> But when I run the job in Mesos fine-grained mode, the spark driver program 
> hangs.
> The spark log is: 
> {code}
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
> Logging.scala (line 58) Cancelling stage 1
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
> Logging.scala (line 79) Could not cancel tasks for stage 1
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scal

[jira] [Closed] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-10-07 Thread Xu Zhongxing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Zhongxing closed SPARK-3005.
---
Resolution: Fixed

> Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
> MesosSchedulerBackend.killTask()
> ---
>
> Key: SPARK-3005
> URL: https://issues.apache.org/jira/browse/SPARK-3005
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
>Reporter: Xu Zhongxing
> Attachments: SPARK-3005_1.diff
>
>
> I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
> cassandra cluster.
> During the job running, I killed the Cassandra daemon to simulate some 
> failure cases. This results in task failures.
> If I run the job in Mesos coarse-grained mode, the spark driver program 
> throws an exception and shutdown cleanly.
> But when I run the job in Mesos fine-grained mode, the spark driver program 
> hangs.
> The spark log is: 
> {code}
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
> Logging.scala (line 58) Cancelling stage 1
>  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
> Logging.scala (line 79) Could not cancel tasks for stage 1
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.con

[jira] [Issue Comment Deleted] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh updated SPARK-3820:
--
Comment: was deleted

(was: See previous comment on resolution.)

> Specialize columnSimilarity() without any threshold
> ---
>
> Key: SPARK-3820
> URL: https://issues.apache.org/jira/browse/SPARK-3820
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Reza Zadeh
>
> `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
> compute the exact cosine similarities. It still requires sampling, which is 
> unnecessary for this case. We should have a specialized version for it, in 
> order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Zadeh resolved SPARK-3820.
---
Resolution: Fixed

See previous comment on resolution.

> Specialize columnSimilarity() without any threshold
> ---
>
> Key: SPARK-3820
> URL: https://issues.apache.org/jira/browse/SPARK-3820
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Reza Zadeh
>
> `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
> compute the exact cosine similarities. It still requires sampling, which is 
> unnecessary for this case. We should have a specialized version for it, in 
> order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162899#comment-14162899
 ] 

Reza Zadeh commented on SPARK-3820:
---

I ran columnSimilarities(0.0) with the random number generation commented, and 
uncommented, and didn't observe any difference in timing for completion of 
stage mapPartitionsWithIndex. 

> Specialize columnSimilarity() without any threshold
> ---
>
> Key: SPARK-3820
> URL: https://issues.apache.org/jira/browse/SPARK-3820
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Reza Zadeh
>
> `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
> compute the exact cosine similarities. It still requires sampling, which is 
> unnecessary for this case. We should have a specialized version for it, in 
> order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3412) Add Missing Types for Row API

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3412.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2689
[https://github.com/apache/spark/pull/2689]

> Add Missing Types for Row API
> -
>
> Key: SPARK-3412
> URL: https://issues.apache.org/jira/browse/SPARK-3412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3843) Cleanup scalastyle.txt at the end of running dev/scalastyle

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162867#comment-14162867
 ] 

Apache Spark commented on SPARK-3843:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2702

> Cleanup scalastyle.txt at the end of running dev/scalastyle
> ---
>
> Key: SPARK-3843
> URL: https://issues.apache.org/jira/browse/SPARK-3843
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Priority: Trivial
>
> dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per 
> running but never deleted even though dev/mima and dev/lint-python delete 
> their log files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3843) Cleanup scalastyle.txt at the end of running dev/scalastyle

2014-10-07 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3843:
-

 Summary: Cleanup scalastyle.txt at the end of running 
dev/scalastyle
 Key: SPARK-3843
 URL: https://issues.apache.org/jira/browse/SPARK-3843
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Trivial


dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per running 
but never deleted even though dev/mima and dev/lint-python delete their log 
files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3569) Add metadata field to StructField

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162853#comment-14162853
 ] 

Apache Spark commented on SPARK-3569:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/2701

> Add metadata field to StructField
> -
>
> Key: SPARK-3569
> URL: https://issues.apache.org/jira/browse/SPARK-3569
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Want to add a metadata field to StructField that can be used by other 
> applications like ML to embed more information about the column.
> {code}
> case class case class StructField(name: String, dataType: DataType, nullable: 
> Boolean, metadata: Map[String, Any] = Map.empty)
> {code}
> For ML, we can store feature information like categorical/continuous, number 
> categories, category-to-index map, etc.
> One question is how to carry over the metadata in query execution. For 
> example:
> {code}
> val features = schemaRDD.select('features)
> val featuresDesc = features.schema('features).metadata
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3842) Remove the hacks for Python callback server in py4j

2014-10-07 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3842:
-

 Summary: Remove the hacks for Python callback server in py4j 
 Key: SPARK-3842
 URL: https://issues.apache.org/jira/browse/SPARK-3842
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Reporter: Davies Liu
Priority: Minor


There are three hacks  while create Python API for Streaming 
(https://github.com/apache/spark/pull/2538) :

1. daemonize the callback server thread, by 'thread.daemon = True' before start 
it  https://github.com/bartdag/py4j/issues/147
2. let callback server bind to random port, then update the Java callback 
client with real port. https://github.com/bartdag/py4j/issues/148
3. start the callback server later. https://github.com/bartdag/py4j/issues/149



These hacks should be removed after py4j has fix these issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3829) Make Spark logo image on the header of HistoryPage as a link to HistoryPage's page #1

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3829.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: Kousuke Saruta

> Make Spark logo image on the header of HistoryPage as a link to HistoryPage's 
> page #1
> -
>
> Key: SPARK-3829
> URL: https://issues.apache.org/jira/browse/SPARK-3829
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> There is a Spark logo on the header of HistoryPage.
> We can have too many HistoryPages if we run 20+ applications. So I think, 
> it's useful if the logo is as a link to the HistoryPage's page #1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold

2014-10-07 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162814#comment-14162814
 ] 

Reza Zadeh commented on SPARK-3820:
---

I will do an experiment to see if the random number generation is adding 
significant overhead, and if it is, then add a flag to avoid it when threshold 
zero is given.

> Specialize columnSimilarity() without any threshold
> ---
>
> Key: SPARK-3820
> URL: https://issues.apache.org/jira/browse/SPARK-3820
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Reza Zadeh
>
> `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to 
> compute the exact cosine similarities. It still requires sampling, which is 
> unnecessary for this case. We should have a specialized version for it, in 
> order to have a fair comparison with DIMSUM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3398.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2339
[https://github.com/apache/spark/pull/2339]

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3832) Upgrade Breeze dependency to 0.10

2014-10-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3832.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2693
[https://github.com/apache/spark/pull/2693]

> Upgrade Breeze dependency to 0.10
> -
>
> Key: SPARK-3832
> URL: https://issues.apache.org/jira/browse/SPARK-3832
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.2.0
>
>
> In Breeze 0.10, the L1regParam can be configured through anonymous function 
> in OWLQN, and each component can be penalized differently. This is required 
> for GLMNET in MLlib with L1/L2 regularization. 
> https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3832) Upgrade Breeze dependency to 0.10

2014-10-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3832:
-
Assignee: DB Tsai

> Upgrade Breeze dependency to 0.10
> -
>
> Key: SPARK-3832
> URL: https://issues.apache.org/jira/browse/SPARK-3832
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.2.0
>
>
> In Breeze 0.10, the L1regParam can be configured through anonymous function 
> in OWLQN, and each component can be penalized differently. This is required 
> for GLMNET in MLlib with L1/L2 regularization. 
> https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3841) Pretty-print Params case classes for tests

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162795#comment-14162795
 ] 

Apache Spark commented on SPARK-3841:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/2700

> Pretty-print Params case classes for tests
> --
>
> Key: SPARK-3841
> URL: https://issues.apache.org/jira/browse/SPARK-3841
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Provide a parent class for the Params case classes used in many MLlib 
> examples, where the parent class pretty-prints the case class fields:
> Param1NameParam1Value
> Param2NameParam2Value
> ...
> Using this class will make it easier to print test settings to logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3838:
-
Assignee: (was: Liquan Pei)

> Python code example for Word2Vec in user guide
> --
>
> Key: SPARK-3838
> URL: https://issues.apache.org/jira/browse/SPARK-3838
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3486) Add PySpark support for Word2Vec

2014-10-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3486.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2356
[https://github.com/apache/spark/pull/2356]

> Add PySpark support for Word2Vec
> 
>
> Key: SPARK-3486
> URL: https://issues.apache.org/jira/browse/SPARK-3486
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Liquan Pei
>Assignee: Liquan Pei
> Fix For: 1.2.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Add PySpark support for Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3841) Pretty-print Params case classes for tests

2014-10-07 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3841:


 Summary: Pretty-print Params case classes for tests
 Key: SPARK-3841
 URL: https://issues.apache.org/jira/browse/SPARK-3841
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Provide a parent class for the Params case classes used in many MLlib examples, 
where the parent class pretty-prints the case class fields:
Param1Name  Param1Type  Param1Value
Param2Name  Param2Type  Param2Value
...
Using this class will make it easier to print test settings to logs.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3841) Pretty-print Params case classes for tests

2014-10-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3841:
-
Description: 
Provide a parent class for the Params case classes used in many MLlib examples, 
where the parent class pretty-prints the case class fields:
Param1Name  Param1Value
Param2Name  Param2Value
...
Using this class will make it easier to print test settings to logs.


  was:
Provide a parent class for the Params case classes used in many MLlib examples, 
where the parent class pretty-prints the case class fields:
Param1Name  Param1Type  Param1Value
Param2Name  Param2Type  Param2Value
...
Using this class will make it easier to print test settings to logs.



> Pretty-print Params case classes for tests
> --
>
> Key: SPARK-3841
> URL: https://issues.apache.org/jira/browse/SPARK-3841
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Provide a parent class for the Params case classes used in many MLlib 
> examples, where the parent class pretty-prints the case class fields:
> Param1NameParam1Value
> Param2NameParam2Value
> ...
> Using this class will make it easier to print test settings to logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3790) CosineSimilarity via DIMSUM example

2014-10-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3790.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2622
[https://github.com/apache/spark/pull/2622]

> CosineSimilarity via DIMSUM example
> ---
>
> Key: SPARK-3790
> URL: https://issues.apache.org/jira/browse/SPARK-3790
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reza Zadeh
>Assignee: Reza Zadeh
> Fix For: 1.2.0
>
>
> Create an example that gives approximation error for DIMSUM using arbitrary 
> RowMatrix given via commandline.
> PR tracking this:
> https://github.com/apache/spark/pull/2622



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3840) Spark EC2 templates fail when variables are missing

2014-10-07 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162783#comment-14162783
 ] 

Allan Douglas R. de Oliveira commented on SPARK-3840:
-

PR: https://github.com/mesos/spark-ec2/pull/74

> Spark EC2 templates fail when variables are missing
> ---
>
> Key: SPARK-3840
> URL: https://issues.apache.org/jira/browse/SPARK-3840
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Allan Douglas R. de Oliveira
>
> For instance https://github.com/mesos/spark-ec2/pull/58 introduced this 
> problem when AWS_ACCESS_KEY_ID isn't set:
> Configuring /root/shark/conf/shark-env.sh
> Traceback (most recent call last):
>   File "./deploy_templates.py", line 91, in 
> text = text.replace("{{" + key + "}}", template_vars[key])
> TypeError: expected a character buffer object
> This makes all the cluster configuration fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162757#comment-14162757
 ] 

Apache Spark commented on SPARK-3654:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2698

> Implement all extended HiveQL statements/commands with a separate parser 
> combinator
> ---
>
> Key: SPARK-3654
> URL: https://issues.apache.org/jira/browse/SPARK-3654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Cheng Lian
> Fix For: 1.2.0
>
>
> Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. 
> are currently parsed in a quite hacky way, like this:
> {code}
> if (sql.trim.toLowerCase.startsWith("cache table")) {
>   sql.trim.toLowerCase.startsWith("cache table") match {
> ...
>   }
> }
> {code}
> It would be much better to add an extra parser combinator that parses these 
> syntax extensions first, and then fallback to the normal Hive parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-10-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162755#comment-14162755
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

1. Yes - the same stuff is installed on master and slaves. In fact they have 
the same AMI.

2. The base Spark AMI is created using `create_image.sh` (from a base Amazon 
AMI) -- After that we pass in the AMI-ID to `spark_ec2.py` which calls 
`setup.sh` on the master.  

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3777) Display "Executor ID" for Tasks in Stage page

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3777.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0

> Display "Executor ID" for Tasks in Stage page
> -
>
> Key: SPARK-3777
> URL: https://issues.apache.org/jira/browse/SPARK-3777
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0, 1.0.2, 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: easy
> Fix For: 1.1.1, 1.2.0
>
>
> Now the Stage page only displays "Executor"(host) for tasks. However, there 
> may be more than one Executors running in the same host. Currently, when some 
> task is hung, I only know the host of the faulty executor. Therefore I have 
> to check all executors in the host.
> Adding "Executor ID" would be helpful to locate the faulty executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3777) Display "Executor ID" for Tasks in Stage page

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3777:
-
Assignee: Shixiong Zhu

> Display "Executor ID" for Tasks in Stage page
> -
>
> Key: SPARK-3777
> URL: https://issues.apache.org/jira/browse/SPARK-3777
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0, 1.0.2, 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: easy
> Fix For: 1.1.1, 1.2.0
>
>
> Now the Stage page only displays "Executor"(host) for tasks. However, there 
> may be more than one Executors running in the same host. Currently, when some 
> task is hung, I only know the host of the faulty executor. Therefore I have 
> to check all executors in the host.
> Adding "Executor ID" would be helpful to locate the faulty executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3840) Spark EC2 templates fail when variables are missing

2014-10-07 Thread Allan Douglas R. de Oliveira (JIRA)
Allan Douglas R. de Oliveira created SPARK-3840:
---

 Summary: Spark EC2 templates fail when variables are missing
 Key: SPARK-3840
 URL: https://issues.apache.org/jira/browse/SPARK-3840
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Allan Douglas R. de Oliveira


For instance https://github.com/mesos/spark-ec2/pull/58 introduced this problem 
when AWS_ACCESS_KEY_ID isn't set:

Configuring /root/shark/conf/shark-env.sh
Traceback (most recent call last):
  File "./deploy_templates.py", line 91, in 
text = text.replace("{{" + key + "}}", template_vars[key])
TypeError: expected a character buffer object

This makes all the cluster configuration fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3661) spark.*.memory is ignored in cluster mode

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162721#comment-14162721
 ] 

Apache Spark commented on SPARK-3661:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2697

> spark.*.memory is ignored in cluster mode
> -
>
> Key: SPARK-3661
> URL: https://issues.apache.org/jira/browse/SPARK-3661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for 
> the config. Note that `spark.executor.memory` is fine only in standalone and 
> mesos mode because we pass the Spark system properties to the driver after it 
> has started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3661) spark.*.memory is ignored in cluster mode

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3661:
-
Description: This is related to 
https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that 
`spark.executor.memory` is fine only in standalone and mesos mode because we 
pass the Spark system properties to the driver after it has started.  (was: 
This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for 
the config. Note that `spark.executor.memory` is fine only in standalone mode 
because we pass the Spark system properties to the driver after it has started.)

> spark.*.memory is ignored in cluster mode
> -
>
> Key: SPARK-3661
> URL: https://issues.apache.org/jira/browse/SPARK-3661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for 
> the config. Note that `spark.executor.memory` is fine only in standalone and 
> mesos mode because we pass the Spark system properties to the driver after it 
> has started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3661) spark.*.memory is ignored in cluster mode

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3661:
-
Description: This is related to 
https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that 
`spark.executor.memory` is fine only in standalone mode because we pass the 
Spark system properties to the driver after it has started.  (was: This is 
related to https://issues.apache.org/jira/browse/SPARK-3653, but for the 
config. Note that `spark.executor.memory` is fine because we pass the Spark 
system properties to the driver after it has started.)

> spark.*.memory is ignored in cluster mode
> -
>
> Key: SPARK-3661
> URL: https://issues.apache.org/jira/browse/SPARK-3661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for 
> the config. Note that `spark.executor.memory` is fine only in standalone mode 
> because we pass the Spark system properties to the driver after it has 
> started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3661) spark.*.memory is ignored in cluster mode

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3661:
-
Summary: spark.*.memory is ignored in cluster mode  (was: 
spark.driver.memory is ignored in cluster mode)

> spark.*.memory is ignored in cluster mode
> -
>
> Key: SPARK-3661
> URL: https://issues.apache.org/jira/browse/SPARK-3661
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for 
> the config. Note that `spark.executor.memory` is fine because we pass the 
> Spark system properties to the driver after it has started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3839) Reimplement HashOuterJoin to construct hash table of only one relation

2014-10-07 Thread Liquan Pei (JIRA)
Liquan Pei created SPARK-3839:
-

 Summary: Reimplement HashOuterJoin to construct hash table of only 
one relation
 Key: SPARK-3839
 URL: https://issues.apache.org/jira/browse/SPARK-3839
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Liquan Pei


Currently, in HashOuterJoin, we build hash tables for both relations, however, 
in left/right outer join, we only need to build hash table for one relation. 
For example, for left hash join, we build a hash table for the right relation 
and stream the left relation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3838) Python code example for Word2Vec in user guide

2014-10-07 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3838:


 Summary: Python code example for Word2Vec in user guide
 Key: SPARK-3838
 URL: https://issues.apache.org/jira/browse/SPARK-3838
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Liquan Pei
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-07 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162685#comment-14162685
 ] 

Sandy Ryza commented on SPARK-3174:
---

bq.  Maybe it makes sense to just call it `spark.dynamicAllocation.*`
That sounds good to me.

bq. I think in general we should limit the number of things that will affect 
adding/removing executors. Otherwise an application might get/lose many 
executors all of a sudden without a good understanding of why. Also 
anticipating what's needed in a future stage is usually fairly difficult, 
because you don't know a priori how long each stage is running. I don't see a 
good metric to decide how far in the future to anticipate for.

Consider the (common) case of a user keeping a Hive session open and setting a 
low number of minimum executors in order to not sit on cluster resources when 
idle.  Goal number 1 should be making queries return as fast as possible.  A 
policy that, upon receiving a job, simply requested executors with enough slots 
to handle all the tasks required by the first stage would be a vast latency and 
user experience improvement over the exponential increase policy.  Given that 
resource managers like YARN will mediate fairness between users and that Spark 
will be able to give executors back, there's not much advantage to being 
conservative or ramping up slowly in this case.  Accurately anticipating 
resource needs is difficult, but not necessary.

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-10-07 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162663#comment-14162663
 ] 

Nicholas Chammas commented on SPARK-3821:
-

[~shivaram] / [~pwendell]:
# In a Spark cluster, what's the difference between what's installed on the 
master and what's installed on the slaves? Is it basically the same stuff, just 
with minor configuration changes?
# Starting from a base AMI, is the rough procedure for creating a fully built 
Spark instance simply running 
[{{create_image.sh}}|https://github.com/mesos/spark-ec2/blob/v3/create_image.sh]
 followed by [{{setup.sh}}|https://github.com/mesos/spark-ec2/blob/v3/setup.sh] 
(minus the stuff that connects to other instances)?

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU

2014-10-07 Thread Reza Farivar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162662#comment-14162662
 ] 

Reza Farivar edited comment on SPARK-3785 at 10/7/14 10:12 PM:
---

Note: project Sumatra might become a part of Java 9, so we might get official 
GPU support in Java some time in the future. 

Sean, I agree that the memory copying is an overhead, but for the right 
application it can become small enough to ignore. Also, you can apply a series 
of operations on an RDD before moving it back to the CPU land. Think rdd.map(x 
=> sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the 
RDD could mean we can run a whole stage in the GPU land, with each task would 
run on different GPU in the cluster not needing to get back in the CPU land 
until we get to a collect() or groupBy(), etc. I imagine we can have a subclass 
of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when 
the runtask() is called.

In fact, given that we have a good number of specialized RDDs, I think we could 
have specialized GPU versions of them easily (say, the CartesianRDD for 
instance). Where it gets tougher is in the mappedRDD function, where you would 
want to pass the arbitrary function to the GPU and hope that it runs. 


was (Author: rfarivar):
I thought to add that the project Sumatra might become a part of Java 9, so we 
might get official GPU support in Java some time in the future. 

Sean, I agree that the memory copying is an overhead, but for the right 
application it can become small enough to ignore. Also, you can apply a series 
of operations on an RDD before moving it back to the CPU land. Think rdd.map(x 
=> sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the 
RDD could mean we can run a whole stage in the GPU land, with each task would 
run on different GPU in the cluster not needing to get back in the CPU land 
until we get to a collect() or groupBy(), etc. I imagine we can have a subclass 
of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when 
the runtask() is called.

In fact, given that we have a good number of specialized RDDs, I think we could 
have specialized GPU versions of them easily (say, the CartesianRDD for 
instance). Where it gets tougher is in the mappedRDD function, where you would 
want to pass the arbitrary function to the GPU and hope that it runs. 

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU

2014-10-07 Thread Reza Farivar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162662#comment-14162662
 ] 

Reza Farivar edited comment on SPARK-3785 at 10/7/14 10:13 PM:
---

Note: project Sumatra might become a part of Java 9, so we might get official 
GPU support in Java some time in the future. 

Sean, I agree that the memory copying is an overhead, but for the right 
application it can become small enough to ignore. Also, you can apply a series 
of operations on an RDD before moving it back to the CPU land. Think rdd.map(x 
=> sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the 
RDD could mean we can run a whole stage in the GPU land, with each task would 
run on a different GPU in the cluster not needing to get back in the CPU land 
until we get to a collect() or groupBy(), etc. I imagine we can have a subclass 
of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when 
the runtask() is called.

In fact, given that we have a good number of specialized RDDs, I think we could 
have specialized GPU versions of them easily (say, the CartesianRDD for 
instance). Where it gets tougher is in the mappedRDD function, where you would 
want to pass the arbitrary function to the GPU and hope that it runs. 


was (Author: rfarivar):
Note: project Sumatra might become a part of Java 9, so we might get official 
GPU support in Java some time in the future. 

Sean, I agree that the memory copying is an overhead, but for the right 
application it can become small enough to ignore. Also, you can apply a series 
of operations on an RDD before moving it back to the CPU land. Think rdd.map(x 
=> sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the 
RDD could mean we can run a whole stage in the GPU land, with each task would 
run on different GPU in the cluster not needing to get back in the CPU land 
until we get to a collect() or groupBy(), etc. I imagine we can have a subclass 
of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when 
the runtask() is called.

In fact, given that we have a good number of specialized RDDs, I think we could 
have specialized GPU versions of them easily (say, the CartesianRDD for 
instance). Where it gets tougher is in the mappedRDD function, where you would 
want to pass the arbitrary function to the GPU and hope that it runs. 

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2014-10-07 Thread Reza Farivar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162662#comment-14162662
 ] 

Reza Farivar commented on SPARK-3785:
-

I thought to add that the project Sumatra might become a part of Java 9, so we 
might get official GPU support in Java some time in the future. 

Sean, I agree that the memory copying is an overhead, but for the right 
application it can become small enough to ignore. Also, you can apply a series 
of operations on an RDD before moving it back to the CPU land. Think rdd.map(x 
=> sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the 
RDD could mean we can run a whole stage in the GPU land, with each task would 
run on different GPU in the cluster not needing to get back in the CPU land 
until we get to a collect() or groupBy(), etc. I imagine we can have a subclass 
of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when 
the runtask() is called.

In fact, given that we have a good number of specialized RDDs, I think we could 
have specialized GPU versions of them easily (say, the CartesianRDD for 
instance). Where it gets tougher is in the mappedRDD function, where you would 
want to pass the arbitrary function to the GPU and hope that it runs. 

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI

2014-10-07 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3682:
--
Attachment: SPARK-3682Design.pdf

Posting an initial design

> Add helpful warnings to the UI
> --
>
> Key: SPARK-3682
> URL: https://issues.apache.org/jira/browse/SPARK-3682
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
> Attachments: SPARK-3682Design.pdf
>
>
> Spark has a zillion configuration options and a zillion different things that 
> can go wrong with a job.  Improvements like incremental and better metrics 
> and the proposed spark replay debugger provide more insight into what's going 
> on under the covers.  However, it's difficult for non-advanced users to 
> synthesize this information and understand where to direct their attention. 
> It would be helpful to have some sort of central location on the UI users 
> could go to that would provide indications about why an app/job is failing or 
> performing poorly.
> Some helpful messages that we could provide:
> * Warn that the tasks in a particular stage are spending a long time in GC.
> * Warn that spark.shuffle.memoryFraction does not fit inside the young 
> generation.
> * Warn that tasks in a particular stage are very short, and that the number 
> of partitions should probably be decreased.
> * Warn that tasks in a particular stage are spilling a lot, and that the 
> number of partitions should probably be increased.
> * Warn that a cached RDD that gets a lot of use does not fit in memory, and a 
> lot of time is being spent recomputing it.
> To start, probably two kinds of warnings would be most helpful.
> * Warnings at the app level that report on misconfigurations, issues with the 
> general health of executors.
> * Warnings at the job level that indicate why a job might be performing 
> slowly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-10-07 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162615#comment-14162615
 ] 

Josh Rosen commented on SPARK-2321:
---

I've opened a WIP pull request in order to discuss the design / implementation 
of a pull-based progress / status API: 
https://github.com/apache/spark/pull/2696.  I'd like to focus on discussing the 
most high-level interface / API design decisions now; once we're happy with 
those decisions, we can focus on the details of which pieces of data to expose, 
etc.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162614#comment-14162614
 ] 

Apache Spark commented on SPARK-2321:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2696

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3637) NPE in ShuffleMapTask

2014-10-07 Thread Steven Lewis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162600#comment-14162600
 ] 

Steven Lewis commented on SPARK-3637:
-

I see the same thing running a Java Map reduce program locally and it is a 
blocking issue for my development especially since I have no clue as to how to 
address it

> NPE in ShuffleMapTask
> -
>
> Key: SPARK-3637
> URL: https://issues.apache.org/jira/browse/SPARK-3637
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Przemyslaw Pastuszka
>
> When trying to execute spark.jobserver.WordCountExample using spark-jobserver 
> (https://github.com/ooyala/spark-jobserver) we observed that often it fails 
> with NullPointerException in ShuffleMapTask.scala. Here are full details:
> {code}
> Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 1.0 (TID 6, 
> hadoop-simple-768-worker-with-zookeeper-0): java.lang.NullPointerException: 
> \njava.nio.ByteBuffer.wrap(ByteBuffer.java:392)\n
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)\n  
>   
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)\n  
>   org.apache.spark.scheduler.Task.run(Task.scala:54)\n
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)\n   
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n
> java.lang.Thread.run(Thread.java:745)\nDriver stacktrace:",
> "errorClass": "org.apache.spark.SparkException",
> "stack": 
> ["org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)",
>  
> "scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
>  "scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)", 
> "org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)",
>  
> "org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)",
>  "scala.Option.foreach(Option.scala:236)", 
> "org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682)",
>  
> "org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359)",
>  "akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)", 
> "akka.actor.ActorCell.invoke(ActorCell.scala:456)", 
> "akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)", 
> "akka.dispatch.Mailbox.run(Mailbox.scala:219)", 
> "akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)",
>  "scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)", 
> "scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)",
>  "scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)", 
> "scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)"
> {code}
> I am aware, that this failure may be due to the job being ill-defined by 
> spark-jobserver (I don't know if that's the case), but if so, then it should 
> be handled more gratefully on spark side.
> What's also important, that this issue doesn't happen always, which may 
> indicate some type of race condition in the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3837) Warn when YARN is killing containers for exceeding memory limits

2014-10-07 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3837:
-

 Summary: Warn when YARN is killing containers for exceeding memory 
limits
 Key: SPARK-3837
 URL: https://issues.apache.org/jira/browse/SPARK-3837
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Sandy Ryza


YARN now lets application masters know when it kills their containers for 
exceeding memory limits.  Spark should log something when this happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2014-10-07 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162547#comment-14162547
 ] 

Davies Liu commented on SPARK-3461:
---

[~pwendell] I will start to work on this after merging 
https://github.com/apache/spark/pull/1977

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application

2014-10-07 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162514#comment-14162514
 ] 

Marcelo Vanzin commented on SPARK-3174:
---

Hi Andrew, thanks for writing this up.

My first question I think is similar to Tom's. It was not clear to me how the 
app will behave when it starts up. I'd expect the first job to be the one that 
has to process the largest amount of data, so it would benefit from having as 
many executors as possible available as quickly as possible - something that 
seems to conflict with the idea of a slow start.

Are you proposing a change to the current semantics, where Yarn will request 
"--num-executors" up front? If you keep that, I think that would cover my above 
concerns. But switching to a slow start with no option to pre-allocate a 
certain numbers seems like it might harm certain jobs.

My second is about the shuffle service you're proposing. Have you investigated 
whether it would be possible to make Hadoop's shuffle service more generic, so 
that Spark can benefit from it? It does mean that this feature might be 
constrained to certain versions of Hadoop, but maybe that's not necessarily a 
bad thing if it means more infrastructure is shared.

> Provide elastic scaling within a Spark application
> --
>
> Key: SPARK-3174
> URL: https://issues.apache.org/jira/browse/SPARK-3174
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Andrew Or
> Attachments: SPARK-3174design.pdf, 
> dynamic-scaling-executors-10-6-14.pdf
>
>
> A common complaint with Spark in a multi-tenant environment is that 
> applications have a fixed allocation that doesn't grow and shrink with their 
> resource needs.  We're blocked on YARN-1197 for dynamically changing the 
> resources within executors, but we can still allocate and discard whole 
> executors.
> It would be useful to have some heuristics that
> * Request more executors when many pending tasks are building up
> * Discard executors when they are idle
> See the latest design doc for more information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3836) Spark REPL optionally propagate internal exceptions

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162470#comment-14162470
 ] 

Apache Spark commented on SPARK-3836:
-

User 'ahirreddy' has created a pull request for this issue:
https://github.com/apache/spark/pull/2695

> Spark REPL optionally propagate internal exceptions 
> 
>
> Key: SPARK-3836
> URL: https://issues.apache.org/jira/browse/SPARK-3836
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ahir Reddy
>Priority: Minor
>
> Optionally have the repl throw exceptions generated by interpreted code, 
> instead of swallowing the exception and returning it as text output. This is 
> useful when embedding the repl, otherwise it's not possible to know when user 
> code threw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3836) Spark REPL optionally propagate internal exceptions

2014-10-07 Thread Ahir Reddy (JIRA)
Ahir Reddy created SPARK-3836:
-

 Summary: Spark REPL optionally propagate internal exceptions 
 Key: SPARK-3836
 URL: https://issues.apache.org/jira/browse/SPARK-3836
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ahir Reddy
Priority: Minor


Optionally have the repl throw exceptions generated by interpreted code, 
instead of swallowing the exception and returning it as text output. This is 
useful when embedding the repl, otherwise it's not possible to know when user 
code threw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3731.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

Fixed by Davies' PR, which I backported to 1.1.

> RDD caching stops working in pyspark after some time
> 
>
> Key: SPARK-3731
> URL: https://issues.apache.org/jira/browse/SPARK-3731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
> Environment: Linux, 32bit, both in local mode or in standalone 
> cluster mode
>Reporter: Milan Straka
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
> Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
> worker.log
>
>
> Consider a file F which when loaded with sc.textFile and cached takes up 
> slightly more than half of free memory for RDD cache.
> When in PySpark the following is executed:
>   1) a = sc.textFile(F)
>   2) a.cache().count()
>   3) b = sc.textFile(F)
>   4) b.cache().count()
> and then the following is repeated (for example 10 times):
>   a) a.unpersist().cache().count()
>   b) b.unpersist().cache().count()
> after some time, there are no RDD cached in memory.
> Also, since that time, no other RDD ever gets cached (the worker always 
> reports something like "WARN CacheManager: Not enough space to cache 
> partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if 
> rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
> all executors have 0MB memory used (which is consistent with the CacheManager 
> warning).
> When doing the same in scala, everything works perfectly.
> I understand that this is a vague description, but I do no know how to 
> describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version

2014-10-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162439#comment-14162439
 ] 

Xiangrui Meng commented on SPARK-3828:
--

I re-opened this because it may be a serious problem. Usually, the line reader 
skips the first record if the start pos is not 0 and always reads one extra 
record after the end pos. In the case [~liquanpei] found, the reader for the 
second partition doesn't skip the first record. So there exists duplicate and 
incorrect content in the resulting RDD. This could be a bug in Hadoop 2.4.0. 
But since Spark heavily depends on `sc.textFile`, it is worth figuring out why.

> Spark returns inconsistent results when building with different Hadoop 
> version 
> ---
>
> Key: SPARK-3828
> URL: https://issues.apache.org/jira/browse/SPARK-3828
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: OSX 10.9, Spark master branch
>Reporter: Liquan Pei
>
> For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please 
> unzip first. 
> Spark build with different Hadoop version returns different result. 
> {code}
> val data = sc.textFile("text8")
> data.count()
> {code}
> returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built 
> with SPARK_HADOOP_VERSION=2.4.0. 
> Looking through the rdd code, it seems that textFile uses hadoopFile which 
> creates HadoopRDD, we should probably create newHadoopRDD when building spark 
> with SPARK_HADOOP_VERSION >= 2.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version

2014-10-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-3828:
--

> Spark returns inconsistent results when building with different Hadoop 
> version 
> ---
>
> Key: SPARK-3828
> URL: https://issues.apache.org/jira/browse/SPARK-3828
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: OSX 10.9, Spark master branch
>Reporter: Liquan Pei
>
> For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please 
> unzip first. 
> Spark build with different Hadoop version returns different result. 
> {code}
> val data = sc.textFile("text8")
> data.count()
> {code}
> returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built 
> with SPARK_HADOOP_VERSION=2.4.0. 
> Looking through the rdd code, it seems that textFile uses hadoopFile which 
> creates HadoopRDD, we should probably create newHadoopRDD when building spark 
> with SPARK_HADOOP_VERSION >= 2.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3825) Log more information when unrolling a block fails

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3825.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

> Log more information when unrolling a block fails
> -
>
> Key: SPARK-3825
> URL: https://issues.apache.org/jira/browse/SPARK-3825
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.1.1, 1.2.0
>
>
> We currently log only the following:
> {code}
> 14/10/06 16:45:42 WARN CacheManager: Not enough space to cache partition 
> rdd_0_2 in memory! Free memory is 481861527 bytes.
> {code}
> This is confusing, however, because "free memory" here means the amount of 
> memory not occupied by blocks. It does not include the amount of memory 
> reserved for unrolling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3816) Add configureOutputJobPropertiesForStorageHandler to JobConf in SparkHadoopWriter

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162420#comment-14162420
 ] 

Apache Spark commented on SPARK-3816:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/2677

> Add configureOutputJobPropertiesForStorageHandler to JobConf in 
> SparkHadoopWriter
> -
>
> Key: SPARK-3816
> URL: https://issues.apache.org/jira/browse/SPARK-3816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Alex Liu
>
> It's similar to SPARK-2846. We should add 
> PlanUtils.configureInputJobPropertiesForStorageHandler to SparkHadoopWriter, 
> so that writer can add configuration from customized StorageHandler to JobConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3790) CosineSimilarity via DIMSUM example

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162415#comment-14162415
 ] 

Apache Spark commented on SPARK-3790:
-

User 'rezazadeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2622

> CosineSimilarity via DIMSUM example
> ---
>
> Key: SPARK-3790
> URL: https://issues.apache.org/jira/browse/SPARK-3790
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reza Zadeh
>Assignee: Reza Zadeh
>
> Create an example that gives approximation error for DIMSUM using arbitrary 
> RowMatrix given via commandline.
> PR tracking this:
> https://github.com/apache/spark/pull/2622



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2759) The ability to read binary files into Spark

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162403#comment-14162403
 ] 

Apache Spark commented on SPARK-2759:
-

User 'kmader' has created a pull request for this issue:
https://github.com/apache/spark/pull/1658

> The ability to read binary files into Spark
> ---
>
> Key: SPARK-2759
> URL: https://issues.apache.org/jira/browse/SPARK-2759
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Java API, Spark Core
>Reporter: Kevin Mader
>
> For reading images, compressed files, or other custom formats it would be 
> useful to have methods that could read the files in as a byte array or 
> DataInputStream so other functions could then process the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3166) Custom serialisers can't be shipped in application jars

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162409#comment-14162409
 ] 

Apache Spark commented on SPARK-3166:
-

User 'GrahamDennis' has created a pull request for this issue:
https://github.com/apache/spark/pull/1890

> Custom serialisers can't be shipped in application jars
> ---
>
> Key: SPARK-3166
> URL: https://issues.apache.org/jira/browse/SPARK-3166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Graham Dennis
>
> Spark cannot currently use a custom serialiser that is shipped with the 
> application jar. Trying to do this causes a java.lang.ClassNotFoundException 
> when trying to instantiate the custom serialiser in the Executor processes. 
> This occurs because Spark attempts to instantiate the custom serialiser 
> before the application jar has been shipped to the Executor process. A 
> reproduction of the problem is available here: 
> https://github.com/GrahamDennis/spark-custom-serialiser
> I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches 
> as of August 21, 2014.  This issue is related to SPARK-2878, and my fix for 
> that issue (https://github.com/apache/spark/pull/1890) also solves this.  My 
> pull request was not merged because it adds the user jar to the Executor 
> processes' class path at launch time.  Such a significant change was thought 
> by [~rxin] to require more QA, and should be considered for inclusion in 1.2 
> at the earliest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162407#comment-14162407
 ] 

Apache Spark commented on SPARK-2489:
-

User 'joesu' has created a pull request for this issue:
https://github.com/apache/spark/pull/1737

> Unsupported parquet datatype optional fixed_len_byte_array
> --
>
> Key: SPARK-2489
> URL: https://issues.apache.org/jira/browse/SPARK-2489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Pei-Lun Lee
>
> tested against commit 9fe693b5
> {noformat}
> scala> sqlContext.parquetFile("/tmp/foo")
> java.lang.RuntimeException: Unsupported parquet datatype optional 
> fixed_len_byte_array(4) b
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
> {noformat}
> example avro schema
> {noformat}
> protocol Test {
> fixed Bytes4(4);
> record Foo {
> union {null, Bytes4} b;
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162413#comment-14162413
 ] 

Apache Spark commented on SPARK-3580:
-

User 'patmcdonough' has created a pull request for this issue:
https://github.com/apache/spark/pull/2447

> Add Consistent Method To Get Number of RDD Partitions Across Different 
> Languages
> 
>
> Key: SPARK-3580
> URL: https://issues.apache.org/jira/browse/SPARK-3580
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Pat McDonough
>  Labels: starter
>
> Programmatically retrieving the number of partitions is not consistent 
> between python and scala. A consistent method should be defined and made 
> public across both languages.
> RDD.partitions.size is also used quite frequently throughout the internal 
> code, so that might be worth refactoring as well once the new method is 
> available.
> What we have today is below.
> In Scala:
> {code}
> scala> someRDD.partitions.size
> res0: Int = 30
> {code}
> In Python:
> {code}
> In [2]: someRDD.getNumPartitions()
> Out[2]: 30
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162405#comment-14162405
 ] 

Apache Spark commented on SPARK-2016:
-

User 'carlosfuertes' has created a pull request for this issue:
https://github.com/apache/spark/pull/1682

> rdd in-memory storage UI becomes unresponsive when the number of RDD 
> partitions is large
> 
>
> Key: SPARK-2016
> URL: https://issues.apache.org/jira/browse/SPARK-2016
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>  Labels: starter
>
> Try run
> {code}
> sc.parallelize(1 to 100, 100).cache().count()
> {code}
> And open the storage UI for this RDD. It takes forever to load the page.
> When the number of partitions is very large, I think there are a few 
> alternatives:
> 0. Only show the top 1000.
> 1. Pagination
> 2. Instead of grouping by RDD blocks, group by executors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3812) Adapt maven build to publish effective pom.

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162417#comment-14162417
 ] 

Apache Spark commented on SPARK-3812:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/2673

> Adapt maven build to publish effective pom.
> ---
>
> Key: SPARK-3812
> URL: https://issues.apache.org/jira/browse/SPARK-3812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Reporter: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3338) Respect user setting of spark.submit.pyFiles

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162411#comment-14162411
 ] 

Apache Spark commented on SPARK-3338:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2232

> Respect user setting of spark.submit.pyFiles
> 
>
> Key: SPARK-3338
> URL: https://issues.apache.org/jira/browse/SPARK-3338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We currently override any setting of spark.submit.pyFiles. Even though this 
> is not documented, we should still respect this if the user explicitly sets 
> this in his/her default properties file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3251) Clarify learning interfaces

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162410#comment-14162410
 ] 

Apache Spark commented on SPARK-3251:
-

User 'BigCrunsh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2137

>  Clarify learning interfaces
> 
>
> Key: SPARK-3251
> URL: https://issues.apache.org/jira/browse/SPARK-3251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Christoph Sawade
>
> *Make threshold mandatory*
> Currently, the output of predict for an example is either the score
> or the class. This side-effect is caused by clearThreshold. To
> clarify that behaviour three different types of predict (predictScore,
> predictClass, predictProbabilty) were introduced; the threshold is not
> longer optional.
> *Clarify classification interfaces*
> Currently, some functionality is spreaded over multiple models.
> In order to clarify the structure and simplify the implementation of
> more complex models (like multinomial logistic regression), two new
> classes are introduced:
> - BinaryClassificationModel: for all models that derives a binary 
> classification from a single weight vector. Comprises the tresholding 
> functionality to derive a prediction from a score. It basically captures 
> SVMModel and LogisticRegressionModel.
> - ProbabilitistClassificaitonModel: This trait defines the interface for 
> models that return a calibrated confidence score (aka probability).
> *Misc*
> - some renaming
> - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162404#comment-14162404
 ] 

Apache Spark commented on SPARK-2017:
-

User 'carlosfuertes' has created a pull request for this issue:
https://github.com/apache/spark/pull/1682

> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Reynold Xin
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2803) add Kafka stream feature for fetch messages from specified starting offset position

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162402#comment-14162402
 ] 

Apache Spark commented on SPARK-2803:
-

User 'pengyanhong' has created a pull request for this issue:
https://github.com/apache/spark/pull/1602

> add Kafka stream feature for fetch messages from specified starting offset 
> position
> ---
>
> Key: SPARK-2803
> URL: https://issues.apache.org/jira/browse/SPARK-2803
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: pengyanhong
>
> There are some use cases that we want to fetch message from specified offset 
> position, as below:
> * replay messages
> * deal with transaction
> * skip bulk incorrect messages
> * random fetch message according to index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2805) update akka to version 2.3

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162406#comment-14162406
 ] 

Apache Spark commented on SPARK-2805:
-

User 'avati' has created a pull request for this issue:
https://github.com/apache/spark/pull/1685

> update akka to version 2.3
> --
>
> Key: SPARK-2805
> URL: https://issues.apache.org/jira/browse/SPARK-2805
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Reporter: Anand Avati
>
> akka-2.3 is the lowest version available in Scala 2.11
> akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
> to reconcile the conflicting dependencies, need to release 
> akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3809) make HiveThriftServer2Suite work correctly

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162418#comment-14162418
 ] 

Apache Spark commented on SPARK-3809:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2675

> make HiveThriftServer2Suite work correctly
> --
>
> Key: SPARK-3809
> URL: https://issues.apache.org/jira/browse/SPARK-3809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> Currently HiveThriftServer2Suite is a fake one, actually HiveThriftServer not 
> started there



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3505) Augmenting SparkStreaming updateStateByKey API with timestamp

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162412#comment-14162412
 ] 

Apache Spark commented on SPARK-3505:
-

User 'xiliu82' has created a pull request for this issue:
https://github.com/apache/spark/pull/2267

> Augmenting SparkStreaming updateStateByKey API with timestamp
> -
>
> Key: SPARK-3505
> URL: https://issues.apache.org/jira/browse/SPARK-3505
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Xi Liu
>Priority: Minor
> Fix For: 1.2.0
>
>
> The current updateStateByKey API in Spark Streaming does not expose timestamp 
> to the application. 
> In our use case, the application need to know the batch timestamp to decide 
> whether to keep the state or not. And we do not want to use real system time 
> because we want to decouple the two (because the same code base is used for 
> streaming and offline processing).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2014-10-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162408#comment-14162408
 ] 

Apache Spark commented on SPARK-2960:
-

User 'roji' has created a pull request for this issue:
https://github.com/apache/spark/pull/1875

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3835) Spark applications that are killed should show up as "KILLED" or "CANCELLED" in the Spark UI

2014-10-07 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-3835:
-

 Summary: Spark applications that are killed should show up as 
"KILLED" or "CANCELLED" in the Spark UI
 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah


Spark applications that crash or are killed are listed as FINISHED in the Spark 
UI.

It looks like the Master only passes back a list of "Running" applications and 
a list of "Completed" applications, All of the applications under "Completed" 
have status "FINISHED", however if they were killed manually they should show 
"CANCELLED", or if they failed they should read "FAILED".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3834) Backticks not correctly handled in subquery aliases

2014-10-07 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-3834:
---

 Summary: Backticks not correctly handled in subquery aliases
 Key: SPARK-3834
 URL: https://issues.apache.org/jira/browse/SPARK-3834
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Ravindra Pesala
Priority: Blocker


[~ravi.pesala]  assigning to you since you fixed the last problem here.  Let me 
know if you don't have time to work on this or if you have any questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2582) Make Block Manager Master pluggable

2014-10-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2582.

Resolution: Won't Fix

I closed this PR a long time ago - but didn't close the asscicated JIRA.

> Make Block Manager Master pluggable
> ---
>
> Key: SPARK-2582
> URL: https://issues.apache.org/jira/browse/SPARK-2582
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.0.0
>Reporter: Hari Shreedharan
>
> Today, there is no way to make the BMM pluggable. So if we want an HA BMM, 
> that needs to replace the current one. Making this pluggable and selected 
> based on a config makes it easy to select HA or non-HA one based on the 
> application's preference. Streaming applications would be better off with an 
> HA one, while a normal application would not care (since the RDDs can be 
> regenerated).
> Since communication from the Block Managers to the BMM is via akka, we can 
> keep that the same and just have the implementation of the BMM implement the 
> actual methods which do the real work - this would not affect the Block 
> Managers too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version

2014-10-07 Thread Liquan Pei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162374#comment-14162374
 ] 

Liquan Pei edited comment on SPARK-3828 at 10/7/14 7:33 PM:


It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, 
when running
{code}
 sc.textFile("text8").map(_.size).collect()
{code}
It returns, 
{code}
Array[Int] = Array(1)
{code} 
which is consistent with text8 file size. However, For Spark built with 2.4.0, 
the above code returns
{code}
Array[Int] = Array(1, 32891136)
{code}
Note that the second entry for 2.4.0 result equals to the 2nd partition size of 
text8, which means that the first record for one that partition is not 
correctly skipped. 

Will try to reproduce it with Hadoop.   


was (Author: liquanpei):
It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, 
when running
{code}
 sc.textFile("text8").map(_.size).collect()
{code}
It returns, 
{code}
Array[Int] = Array(1)
{code} 
which is consistent with text8 file size. However, For Spark built with 2.4.0, 
the above code returns
{code}
Array[Int] = Array(1, 32891136)
{code}
Note that the second entry for 2.4.0 result equals to the second partition size 
of text8, which means that the first record for one that partition is not 
correctly skipped. 

Will try to reproduce it with Hadoop.   

> Spark returns inconsistent results when building with different Hadoop 
> version 
> ---
>
> Key: SPARK-3828
> URL: https://issues.apache.org/jira/browse/SPARK-3828
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: OSX 10.9, Spark master branch
>Reporter: Liquan Pei
>
> For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please 
> unzip first. 
> Spark build with different Hadoop version returns different result. 
> {code}
> val data = sc.textFile("text8")
> data.count()
> {code}
> returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built 
> with SPARK_HADOOP_VERSION=2.4.0. 
> Looking through the rdd code, it seems that textFile uses hadoopFile which 
> creates HadoopRDD, we should probably create newHadoopRDD when building spark 
> with SPARK_HADOOP_VERSION >= 2.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version

2014-10-07 Thread Liquan Pei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162374#comment-14162374
 ] 

Liquan Pei commented on SPARK-3828:
---

It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, 
when running
{code}
 sc.textFile("text8").map(_.size).collect()
{code}
It returns, 
{code}
Array[Int] = Array(1)
{code} 
which is consistent with text8 file size. However, For Spark built with 2.4.0, 
the above code returns
{code}
Array[Int] = Array(1, 32891136)
{code}
Note that the second entry for 2.4.0 result equals to the second partition size 
of text8, which means that the first record for one that partition is not 
correctly skipped. 

Will try to reproduce it with Hadoop.   

> Spark returns inconsistent results when building with different Hadoop 
> version 
> ---
>
> Key: SPARK-3828
> URL: https://issues.apache.org/jira/browse/SPARK-3828
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: OSX 10.9, Spark master branch
>Reporter: Liquan Pei
>
> For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please 
> unzip first. 
> Spark build with different Hadoop version returns different result. 
> {code}
> val data = sc.textFile("text8")
> data.count()
> {code}
> returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built 
> with SPARK_HADOOP_VERSION=2.4.0. 
> Looking through the rdd code, it seems that textFile uses hadoopFile which 
> creates HadoopRDD, we should probably create newHadoopRDD when building spark 
> with SPARK_HADOOP_VERSION >= 2.0.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3765) Add test information to sbt build docs

2014-10-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3765.

Resolution: Fixed
  Assignee: wangfei

This was resolved by:
https://github.com/apache/spark/pull/2629

> Add test information to sbt build docs
> --
>
> Key: SPARK-3765
> URL: https://issues.apache.org/jira/browse/SPARK-3765
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: wangfei
>Assignee: wangfei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3765) Add test information to sbt build docs

2014-10-07 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3765:
---
Fix Version/s: 1.2.0

> Add test information to sbt build docs
> --
>
> Key: SPARK-3765
> URL: https://issues.apache.org/jira/browse/SPARK-3765
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: wangfei
>Assignee: wangfei
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService

2014-10-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162351#comment-14162351
 ] 

Patrick Wendell commented on SPARK-3797:


For the dependencies issue - the plan is to create a separate build module that 
only contains the jar for the shuffle service so we can produce a jar with only 
this service and not the rest of Spark's dependency graph. This won't have any 
dependencies except for netty which is already a dependency of YARN and we are 
using the same version, and potentially the scala library jar (though we've 
even discussed writing this particular component in Java). I think that fully 
solves the issues Sandy has mentioned.

BTW in general I don't think we are going to require this to run Spark-on-YARN 
in the future - it will just be a mode that people can run in if they want to 
have better elasticity.

> Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
> --
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> It's also worth considering running the shuffle service in a YARN container 
> beside the executor(s) on each node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3731:
--
Affects Version/s: 1.0.2

> RDD caching stops working in pyspark after some time
> 
>
> Key: SPARK-3731
> URL: https://issues.apache.org/jira/browse/SPARK-3731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
> Environment: Linux, 32bit, both in local mode or in standalone 
> cluster mode
>Reporter: Milan Straka
>Assignee: Davies Liu
>Priority: Critical
> Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
> worker.log
>
>
> Consider a file F which when loaded with sc.textFile and cached takes up 
> slightly more than half of free memory for RDD cache.
> When in PySpark the following is executed:
>   1) a = sc.textFile(F)
>   2) a.cache().count()
>   3) b = sc.textFile(F)
>   4) b.cache().count()
> and then the following is repeated (for example 10 times):
>   a) a.unpersist().cache().count()
>   b) b.unpersist().cache().count()
> after some time, there are no RDD cached in memory.
> Also, since that time, no other RDD ever gets cached (the worker always 
> reports something like "WARN CacheManager: Not enough space to cache 
> partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if 
> rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
> all executors have 0MB memory used (which is consistent with the CacheManager 
> warning).
> When doing the same in scala, everything works perfectly.
> I understand that this is a vague description, but I do no know how to 
> describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3731:
--
Target Version/s: 1.1.1, 1.2.0  (was: 1.1.1, 1.2.0, 1.0.3)

> RDD caching stops working in pyspark after some time
> 
>
> Key: SPARK-3731
> URL: https://issues.apache.org/jira/browse/SPARK-3731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux, 32bit, both in local mode or in standalone 
> cluster mode
>Reporter: Milan Straka
>Assignee: Davies Liu
>Priority: Critical
> Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
> worker.log
>
>
> Consider a file F which when loaded with sc.textFile and cached takes up 
> slightly more than half of free memory for RDD cache.
> When in PySpark the following is executed:
>   1) a = sc.textFile(F)
>   2) a.cache().count()
>   3) b = sc.textFile(F)
>   4) b.cache().count()
> and then the following is repeated (for example 10 times):
>   a) a.unpersist().cache().count()
>   b) b.unpersist().cache().count()
> after some time, there are no RDD cached in memory.
> Also, since that time, no other RDD ever gets cached (the worker always 
> reports something like "WARN CacheManager: Not enough space to cache 
> partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if 
> rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
> all executors have 0MB memory used (which is consistent with the CacheManager 
> warning).
> When doing the same in scala, everything works perfectly.
> I understand that this is a vague description, but I do no know how to 
> describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3731:
--
Affects Version/s: (was: 1.0.2)
   1.2.0

> RDD caching stops working in pyspark after some time
> 
>
> Key: SPARK-3731
> URL: https://issues.apache.org/jira/browse/SPARK-3731
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0, 1.2.0
> Environment: Linux, 32bit, both in local mode or in standalone 
> cluster mode
>Reporter: Milan Straka
>Assignee: Davies Liu
>Priority: Critical
> Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, 
> worker.log
>
>
> Consider a file F which when loaded with sc.textFile and cached takes up 
> slightly more than half of free memory for RDD cache.
> When in PySpark the following is executed:
>   1) a = sc.textFile(F)
>   2) a.cache().count()
>   3) b = sc.textFile(F)
>   4) b.cache().count()
> and then the following is repeated (for example 10 times):
>   a) a.unpersist().cache().count()
>   b) b.unpersist().cache().count()
> after some time, there are no RDD cached in memory.
> Also, since that time, no other RDD ever gets cached (the worker always 
> reports something like "WARN CacheManager: Not enough space to cache 
> partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if 
> rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
> all executors have 0MB memory used (which is consistent with the CacheManager 
> warning).
> When doing the same in scala, everything works perfectly.
> I understand that this is a vague description, but I do no know how to 
> describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3762) clear all SparkEnv references after stop

2014-10-07 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3762.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> clear all SparkEnv references after stop
> 
>
> Key: SPARK-3762
> URL: https://issues.apache.org/jira/browse/SPARK-3762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.0
>
>
> SparkEnv is cached in ThreadLocal object, so after stop and create a new 
> SparkContext, old SparkEnv is still used by some threads, it will trigger 
> many problems.
> We should clear all the references after stop a SparkEnv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-10-07 Thread Sotos Matzanas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sotos Matzanas closed SPARK-3732.
-
Resolution: Won't Fix

> Yarn Client: Add option to NOT System.exit() at end of main()
> -
>
> Key: SPARK-3732
> URL: https://issues.apache.org/jira/browse/SPARK-3732
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Sotos Matzanas
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> We would like to add the ability to create and submit Spark jobs 
> programmatically via Scala/Java. We have found a way to hack this and submit 
> jobs via Yarn, but since 
> org.apache.spark.deploy.yarn.Client.main()
> exits with either 0 or 1 in the end, this will mean exit of our own program. 
> We would like to add an optional spark conf param to NOT exit at the end of 
> the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3808) PySpark fails to start in Windows

2014-10-07 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162310#comment-14162310
 ] 

Andrew Or commented on SPARK-3808:
--

Hey [~tsudukim] can you verify that pyspark, spark-shell and spark-submit work 
as expected in Windows now?

> PySpark fails to start in Windows
> -
>
> Key: SPARK-3808
> URL: https://issues.apache.org/jira/browse/SPARK-3808
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 1.2.0
> Environment: Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Blocker
> Fix For: 1.2.0
>
>
> When we execute bin\pyspark.cmd in Windows, it fails to start.
> We get following messages.
> {noformat}
> C:\>bin\pyspark.cmd
> Running C:\\python.exe with 
> PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python;
> Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> ="x" was unexpected at this time.
> Traceback (most recent call last):
>   File "C:\\bin\..\python\pyspark\shell.py", line 45, in 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
>   File "C:\\python\pyspark\context.py", line 103, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway
> raise Exception(error_msg)
> Exception: Launching GatewayServer failed with exit code 255!
> Warning: Expected GatewayServer to output a port, but found no output.
> >>>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-07 Thread Igor Tkachenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Tkachenko closed SPARK-3761.
-
Resolution: Fixed

> Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
> -
>
> Key: SPARK-3761
> URL: https://issues.apache.org/jira/browse/SPARK-3761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Igor Tkachenko
>Priority: Critical
>
> I have Scala code:
> val master = "spark://:7077"
> val sc = new SparkContext(new SparkConf()
>   .setMaster(master)
>   .setAppName("SparkQueryDemo 01")
>   .set("spark.executor.memory", "512m"))
> val count2 = sc .textFile("hdfs:// address>:8020/tmp/data/risk/account.txt")
>   .filter(line  => line.contains("Word"))
>   .count()
> I've got such an error:
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0.0:0 failed 4 times, most
> recent failure: Exception failure in TID 6 on host : 
> java.lang.ClassNotFoundExcept
> ion: SimpleApp$$anonfun$1
> My dependencies :
> object Version {
>   val spark= "1.0.0-cdh5.1.0"
> }
> object Library {
>   val sparkCore  = "org.apache.spark"  % "spark-assembly_2.10"  % 
> Version.spark
> }
> My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3808) PySpark fails to start in Windows

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3808.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Assignee: Masayoshi TSUZUKI
Target Version/s: 1.2.0

> PySpark fails to start in Windows
> -
>
> Key: SPARK-3808
> URL: https://issues.apache.org/jira/browse/SPARK-3808
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 1.2.0
> Environment: Windows
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Blocker
> Fix For: 1.2.0
>
>
> When we execute bin\pyspark.cmd in Windows, it fails to start.
> We get following messages.
> {noformat}
> C:\>bin\pyspark.cmd
> Running C:\\python.exe with 
> PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python;
> Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> ="x" was unexpected at this time.
> Traceback (most recent call last):
>   File "C:\\bin\..\python\pyspark\shell.py", line 45, in 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
>   File "C:\\python\pyspark\context.py", line 103, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway
> raise Exception(error_msg)
> Exception: Launching GatewayServer failed with exit code 255!
> Warning: Expected GatewayServer to output a port, but found no output.
> >>>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4

2014-10-07 Thread Igor Tkachenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162307#comment-14162307
 ] 

Igor Tkachenko commented on SPARK-3761:
---

After I've added line sc.addJar("") it works for sbt (but does not for maven project, happily it is enough 
for me). We can close the issue.

> Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
> -
>
> Key: SPARK-3761
> URL: https://issues.apache.org/jira/browse/SPARK-3761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Igor Tkachenko
>Priority: Critical
>
> I have Scala code:
> val master = "spark://:7077"
> val sc = new SparkContext(new SparkConf()
>   .setMaster(master)
>   .setAppName("SparkQueryDemo 01")
>   .set("spark.executor.memory", "512m"))
> val count2 = sc .textFile("hdfs:// address>:8020/tmp/data/risk/account.txt")
>   .filter(line  => line.contains("Word"))
>   .count()
> I've got such an error:
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0.0:0 failed 4 times, most
> recent failure: Exception failure in TID 6 on host : 
> java.lang.ClassNotFoundExcept
> ion: SimpleApp$$anonfun$1
> My dependencies :
> object Version {
>   val spark= "1.0.0-cdh5.1.0"
> }
> object Library {
>   val sparkCore  = "org.apache.spark"  % "spark-assembly_2.10"  % 
> Version.spark
> }
> My OS is Win 7, sbt 13.5, Scala 2.10.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3808) PySpark fails to start in Windows

2014-10-07 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3808:
-
Affects Version/s: (was: 1.1.0)
   1.2.0

> PySpark fails to start in Windows
> -
>
> Key: SPARK-3808
> URL: https://issues.apache.org/jira/browse/SPARK-3808
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 1.2.0
> Environment: Windows
>Reporter: Masayoshi TSUZUKI
>Priority: Blocker
>
> When we execute bin\pyspark.cmd in Windows, it fails to start.
> We get following messages.
> {noformat}
> C:\>bin\pyspark.cmd
> Running C:\\python.exe with 
> PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python;
> Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on 
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> ="x" was unexpected at this time.
> Traceback (most recent call last):
>   File "C:\\bin\..\python\pyspark\shell.py", line 45, in 
> sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
>   File "C:\\python\pyspark\context.py", line 103, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway
> raise Exception(error_msg)
> Exception: Launching GatewayServer failed with exit code 255!
> Warning: Expected GatewayServer to output a port, but found no output.
> >>>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3297) [Spark SQL][UI] SchemaRDD toString with many columns messes up Storage tab display

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3297.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: Hossein Falaki

Fixed by https://github.com/apache/spark/pull/2687

> [Spark SQL][UI] SchemaRDD toString with many columns messes up Storage tab 
> display
> --
>
> Key: SPARK-3297
> URL: https://issues.apache.org/jira/browse/SPARK-3297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Assignee: Hossein Falaki
>Priority: Minor
>  Labels: newbie
> Fix For: 1.1.1, 1.2.0
>
>
> When a SchemaRDD with many columns (for example, 57 columns in this example) 
> is cached using sqlContext.cacheTable, the Storage tab of the driver Web UI 
> display gets messed up, because the long string of the SchemaRDD causes the 
> first column to be much much wider than the others, and in fact much wider 
> than the width of the browser.  It would be nice to have the first column be 
> restricted to, say, 50% of the width of the browser window, with some minimum.
> For example this is the SchemaRDD text for my table:
> RDD Storage Info for ExistingRdd 
> [ActionGeo_ADM1Code#198,ActionGeo_CountryCode#199,ActionGeo_FeatureID#200,ActionGeo_FullName#201,ActionGeo_Lat#202,ActionGeo_Long#203,ActionGeo_Type#204,Actor1Code#205,Actor1CountryCode#206,Actor1EthnicCode#207,Actor1Geo_ADM1Code#208,Actor1Geo_CountryCode#209,Actor1Geo_FeatureID#210,Actor1Geo_FullName#211,Actor1Geo_Lat#212,Actor1Geo_Long#213,Actor1Geo_Type#214,Actor1KnownGroupCode#215,Actor1Name#216,Actor1Religion1Code#217,Actor1Religion2Code#218,Actor1Type1Code#219,Actor1Type2Code#220,Actor1Type3Code#221,Actor2Code#222,Actor2CountryCode#223,Actor2EthnicCode#224,Actor2Geo_ADM1Code#225,Actor2Geo_CountryCode#226,Actor2Geo_FeatureID#227,Actor2Geo_FullName#228,Actor2Geo_Lat#229,Actor2Geo_Long#230,Actor2Geo_Type#231,Actor2KnownGroupCode#232,Actor2Name#233,Actor2Religion1Code#234,Actor2Religion2Code#235,Actor2Type1Code#236,Actor2Type2Code#237,Actor2Type3Code#238,AvgTone#239,DATEADDED#240,Day#241,EventBaseCode#242,EventCode#243,EventId#244,EventRootCode#245,FractionDate#246,GoldsteinScale#247,IsRootEvent#248,MonthYear#249,NumArticles#250,NumMentions#251,NumSources#252,QuadClass#253,Year#254],
>  MappedRDD[200]
> I would personally love to fix the toString method to not necessarily print 
> every column, but to cut it off after a while.  This would aid the printout 
> in the Spark Shell as well.  For example:
> [ActionGeo_ADM1Code#198,ActionGeo_CountryCode#199,ActionGeo_FeatureID#200,ActionGeo_FullName#201,ActionGeo_Lat#202
>   and 52 more columns]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2915) Storage summary table UI glitch when using sparkSQL

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2915.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: Hossein Falaki

Fixed by https://github.com/apache/spark/pull/2687

> Storage summary table UI glitch when using sparkSQL
> ---
>
> Key: SPARK-2915
> URL: https://issues.apache.org/jira/browse/SPARK-2915
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.2
> Environment: Standalone
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
>Priority: Minor
>  Labels: WebUI
> Fix For: 1.1.1, 1.2.0
>
>
> When using sqlContext.cacheTable() a registered table. the name of the RDD 
> becomes a very large string (related to the query that created the sqlRDD). 
> As a result the first columns of the storage tab in SparkUI becomes very long 
> and the other columns become squashed.
> Since the name of the RDD is not human readable, we can simply set ellipsis 
> in the first cell (which will hide the rest of string). Alternatively we can 
> fix the RDD name to a more readable and shorter name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3827) Very long RDD names are not rendered properly in web UI

2014-10-07 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3827.
---
   Resolution: Fixed
Fix Version/s: 1.1.1

Issue resolved by pull request 2687
[https://github.com/apache/spark/pull/2687]

> Very long RDD names are not rendered properly in web UI
> ---
>
> Key: SPARK-3827
> URL: https://issues.apache.org/jira/browse/SPARK-3827
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Hossein Falaki
>Priority: Minor
> Fix For: 1.2.0, 1.1.1
>
>
> With Spark SQL we generate very long RDD names. These names are not properly 
> rendered in the web UI. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService

2014-10-07 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162254#comment-14162254
 ] 

Andrew Or commented on SPARK-3797:
--

Thanks for detailing the considerations Sandy. I agree with every single one of 
the drawbacks you listed.

The alternative of launching the shuffle service inside containers has been 
given much thought. However, it will be overkill if we allocate one such 
service for each executor or even application. In general, these services are 
intended to be long-running local resource managers that are really more suited 
to be run per-node. As you suggested, these services tend to have low memory 
requirements and would be forced to take up more than what is needed.

For the rolling upgrades point, we can add some logic as in MR to handle short 
outages as Tom suggested. The dependency and deployment stories are a little 
harder to workaround. I think the point here is that either way we need to 
offer an alternative of running it independently of the NM in case the cluster 
has conflicting dependencies. Perhaps we'll need some 
`start-shuffle-service.sh` script to launch these containers on all nodes 
before running any actual Spark application. I should note that our shuffle 
service is intended to be fairly lightweight and will have very limited 
dependencies (e.g. we are considering building it with Java because we don't 
want to bundle Scala). Hopefully that mitigates the issue.

> Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
> --
>
> Key: SPARK-3797
> URL: https://issues.apache.org/jira/browse/SPARK-3797
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> It's also worth considering running the shuffle service in a YARN container 
> beside the executor(s) on each node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3819) Jenkins should compile Spark against multiple versions of Hadoop

2014-10-07 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162195#comment-14162195
 ] 

Matt Cheah commented on SPARK-3819:
---

Can you elaborate as to why it is not feasible to build against multiple Hadoop 
versions? Is it simply because it is too slow?

I still strongly stand by the idea of making the need to test building against 
multiple versions explicit to the contributor. We need to minimize the risk of 
breaking the build.

> Jenkins should compile Spark against multiple versions of Hadoop
> 
>
> Key: SPARK-3819
> URL: https://issues.apache.org/jira/browse/SPARK-3819
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.0
>Reporter: Matt Cheah
>Priority: Minor
>  Labels: Jenkins
> Fix For: 1.1.1
>
>
> The build broke because of PR 
> https://github.com/apache/spark/pull/2609#issuecomment-57962393 - however the 
> build failure was not caught by Jenkins. From what I understand the build 
> failure occurs when Spark is built manually against certain versions of 
> Hadoop.
> It seems intuitive that Jenkins should catch this sort of thing. The code 
> should be compiled against multiple Hadoop versions. It seems like overkill 
> to run the full test suite against all Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-10-07 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--
Attachment: spark-1297-v6.txt

Patch v6 uses 0.98.5 hbase release.

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
> spark-1297-v5.txt, spark-1297-v6.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162156#comment-14162156
 ] 

Xiangrui Meng commented on SPARK-3434:
--

[~shivaram] and [~ConcreteVitamin] Any updates on the design doc and prototype?

> Distributed block matrix
> 
>
> Key: SPARK-3434
> URL: https://issues.apache.org/jira/browse/SPARK-3434
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> This JIRA is for discussing distributed matrices stored in block 
> sub-matrices. The main challenge is the partitioning scheme to allow adding 
> linear algebra operations in the future, e.g.:
> 1. matrix multiplication
> 2. matrix factorization (QR, LU, ...)
> Let's discuss the partitioning and storage and how they fit into the above 
> use cases.
> Questions:
> 1. Should it be backed by a single RDD that contains all of the sub-matrices 
> or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >