[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163125#comment-14163125 ] Sean Owen commented on SPARK-3785: -- Sure, but a GPU isn't going to be good at general map, filter, reduce, groupBy operations. It can't run arbitrary functions like the JVM. I wonder how many use cases actually consist of enough computation that can be specialized for the GPU, chained together, that makes the GPU worth it. My suspicious is still that there are really a few wins for this use case but that they are achievable by just calling to the GPU from Java code. I'd love to see that this is in fact a way to transparently speed up a non-trivial slice of mainstream Spark use cases though. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3836) Spark REPL optionally propagate internal exceptions
[ https://issues.apache.org/jira/browse/SPARK-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3836. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Ahir Reddy https://github.com/apache/spark/pull/2695 > Spark REPL optionally propagate internal exceptions > > > Key: SPARK-3836 > URL: https://issues.apache.org/jira/browse/SPARK-3836 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ahir Reddy >Assignee: Ahir Reddy >Priority: Minor > Fix For: 1.2.0 > > > Optionally have the repl throw exceptions generated by interpreted code, > instead of swallowing the exception and returning it as text output. This is > useful when embedding the repl, otherwise it's not possible to know when user > code threw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3807) SparkSql does not work for tables created using custom serde
[ https://issues.apache.org/jira/browse/SPARK-3807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chirag aggarwal updated SPARK-3807: --- Description: SparkSql crashes on selecting tables using custom serde. Example: CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE; The following exception is seen on running a query like 'select * from table_name limit 1': ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68) at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86) at org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:100) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.NullPointerException After fixing this issue, when some columns in the table were referred in the query, sparksql could not resolve those references. was: SparkSql crashes on selecting tables using custom serde. Example: CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE; The following exception is seen on running a query like 'select * from table_name limit 1': ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68) at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86) at org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:100) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(Hi
[jira] [Commented] (SPARK-2811) update algebird to 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163034#comment-14163034 ] Dan Di Spaltro commented on SPARK-2811: --- Looks like algebird_2.11 artifacts are on maven central. http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22algebird_2.11%22 > update algebird to 0.8.1 > > > Key: SPARK-2811 > URL: https://issues.apache.org/jira/browse/SPARK-2811 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > First algebird_2.11 0.8.1 has to be released -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163027#comment-14163027 ] Apache Spark commented on SPARK-2426: - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/2705 > Quadratic Minimization for MLlib ALS > > > Key: SPARK-2426 > URL: https://issues.apache.org/jira/browse/SPARK-2426 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Debasish Das >Assignee: Debasish Das > Original Estimate: 504h > Remaining Estimate: 504h > > Current ALS supports least squares and nonnegative least squares. > I presented ADMM and IPM based Quadratic Minimization solvers to be used for > the following ALS problems: > 1. ALS with bounds > 2. ALS with L1 regularization > 3. ALS with Equality constraint and bounds > Initial runtime comparisons are presented at Spark Summit. > http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark > Based on Xiangrui's feedback I am currently comparing the ADMM based > Quadratic Minimization solvers with IPM based QpSolvers and the default > ALS/NNLS. I will keep updating the runtime comparison results. > For integration the detailed plan is as follows: > 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization > 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3781) code style format
[ https://issues.apache.org/jira/browse/SPARK-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163018#comment-14163018 ] Apache Spark commented on SPARK-3781: - User 'shijinkui' has created a pull request for this issue: https://github.com/apache/spark/pull/2704 > code style format > - > > Key: SPARK-3781 > URL: https://issues.apache.org/jira/browse/SPARK-3781 > Project: Spark > Issue Type: Improvement >Reporter: sjk > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1656) Potential resource leak in HttpBroadcast, SparkSubmitArguments, FileSystemPersistenceEngine and DiskStore
[ https://issues.apache.org/jira/browse/SPARK-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-1656. - Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Already merged into master and branch-1.1 > Potential resource leak in HttpBroadcast, SparkSubmitArguments, > FileSystemPersistenceEngine and DiskStore > - > > Key: SPARK-1656 > URL: https://issues.apache.org/jira/browse/SPARK-1656 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Labels: easyfix > Fix For: 1.1.1, 1.2.0 > > > Again... I'm trying to review all `close` statements to find such issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3808) PySpark fails to start in Windows
[ https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162935#comment-14162935 ] Masayoshi TSUZUKI commented on SPARK-3808: -- I verified and it works well! Thank you [~andrewor14] > PySpark fails to start in Windows > - > > Key: SPARK-3808 > URL: https://issues.apache.org/jira/browse/SPARK-3808 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 1.2.0 > Environment: Windows >Reporter: Masayoshi TSUZUKI >Assignee: Masayoshi TSUZUKI >Priority: Blocker > Fix For: 1.2.0 > > > When we execute bin\pyspark.cmd in Windows, it fails to start. > We get following messages. > {noformat} > C:\>bin\pyspark.cmd > Running C:\\python.exe with > PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python; > Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. > ="x" was unexpected at this time. > Traceback (most recent call last): > File "C:\\bin\..\python\pyspark\shell.py", line 45, in > sc = SparkContext(appName="PySparkShell", pyFiles=add_files) > File "C:\\python\pyspark\context.py", line 103, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway > raise Exception(error_msg) > Exception: Launching GatewayServer failed with exit code 255! > Warning: Expected GatewayServer to output a port, but found no output. > >>> > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162913#comment-14162913 ] Xu Zhongxing commented on SPARK-3005: - Resolved in https://github.com/apache/spark/pull/2453 > Spark with Mesos fine-grained mode throws UnsupportedOperationException in > MesosSchedulerBackend.killTask() > --- > > Key: SPARK-3005 > URL: https://issues.apache.org/jira/browse/SPARK-3005 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2 > Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector >Reporter: Xu Zhongxing > Attachments: SPARK-3005_1.diff > > > I am using Spark, Mesos, spark-cassandra-connector to do some work on a > cassandra cluster. > During the job running, I killed the Cassandra daemon to simulate some > failure cases. This results in task failures. > If I run the job in Mesos coarse-grained mode, the spark driver program > throws an exception and shutdown cleanly. > But when I run the job in Mesos fine-grained mode, the spark driver program > hangs. > The spark log is: > {code} > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 > Logging.scala (line 58) Cancelling stage 1 > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 > Logging.scala (line 79) Could not cancel tasks for stage 1 > java.lang.UnsupportedOperationException > at > org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) > at > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scal
[jira] [Closed] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Zhongxing closed SPARK-3005. --- Resolution: Fixed > Spark with Mesos fine-grained mode throws UnsupportedOperationException in > MesosSchedulerBackend.killTask() > --- > > Key: SPARK-3005 > URL: https://issues.apache.org/jira/browse/SPARK-3005 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2 > Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector >Reporter: Xu Zhongxing > Attachments: SPARK-3005_1.diff > > > I am using Spark, Mesos, spark-cassandra-connector to do some work on a > cassandra cluster. > During the job running, I killed the Cassandra daemon to simulate some > failure cases. This results in task failures. > If I run the job in Mesos coarse-grained mode, the spark driver program > throws an exception and shutdown cleanly. > But when I run the job in Mesos fine-grained mode, the spark driver program > hangs. > The spark log is: > {code} > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 > Logging.scala (line 58) Cancelling stage 1 > INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 > Logging.scala (line 79) Could not cancel tasks for stage 1 > java.lang.UnsupportedOperationException > at > org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) > at > org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.con
[jira] [Issue Comment Deleted] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh updated SPARK-3820: -- Comment: was deleted (was: See previous comment on resolution.) > Specialize columnSimilarity() without any threshold > --- > > Key: SPARK-3820 > URL: https://issues.apache.org/jira/browse/SPARK-3820 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Reza Zadeh > > `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to > compute the exact cosine similarities. It still requires sampling, which is > unnecessary for this case. We should have a specialized version for it, in > order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Zadeh resolved SPARK-3820. --- Resolution: Fixed See previous comment on resolution. > Specialize columnSimilarity() without any threshold > --- > > Key: SPARK-3820 > URL: https://issues.apache.org/jira/browse/SPARK-3820 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Reza Zadeh > > `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to > compute the exact cosine similarities. It still requires sampling, which is > unnecessary for this case. We should have a specialized version for it, in > order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162899#comment-14162899 ] Reza Zadeh commented on SPARK-3820: --- I ran columnSimilarities(0.0) with the random number generation commented, and uncommented, and didn't observe any difference in timing for completion of stage mapPartitionsWithIndex. > Specialize columnSimilarity() without any threshold > --- > > Key: SPARK-3820 > URL: https://issues.apache.org/jira/browse/SPARK-3820 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Reza Zadeh > > `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to > compute the exact cosine similarities. It still requires sampling, which is > unnecessary for this case. We should have a specialized version for it, in > order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3412) Add Missing Types for Row API
[ https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3412. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2689 [https://github.com/apache/spark/pull/2689] > Add Missing Types for Row API > - > > Key: SPARK-3412 > URL: https://issues.apache.org/jira/browse/SPARK-3412 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3843) Cleanup scalastyle.txt at the end of running dev/scalastyle
[ https://issues.apache.org/jira/browse/SPARK-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162867#comment-14162867 ] Apache Spark commented on SPARK-3843: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2702 > Cleanup scalastyle.txt at the end of running dev/scalastyle > --- > > Key: SPARK-3843 > URL: https://issues.apache.org/jira/browse/SPARK-3843 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Priority: Trivial > > dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per > running but never deleted even though dev/mima and dev/lint-python delete > their log files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3843) Cleanup scalastyle.txt at the end of running dev/scalastyle
Kousuke Saruta created SPARK-3843: - Summary: Cleanup scalastyle.txt at the end of running dev/scalastyle Key: SPARK-3843 URL: https://issues.apache.org/jira/browse/SPARK-3843 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Trivial dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per running but never deleted even though dev/mima and dev/lint-python delete their log files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3569) Add metadata field to StructField
[ https://issues.apache.org/jira/browse/SPARK-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162853#comment-14162853 ] Apache Spark commented on SPARK-3569: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/2701 > Add metadata field to StructField > - > > Key: SPARK-3569 > URL: https://issues.apache.org/jira/browse/SPARK-3569 > Project: Spark > Issue Type: New Feature > Components: MLlib, SQL >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Want to add a metadata field to StructField that can be used by other > applications like ML to embed more information about the column. > {code} > case class case class StructField(name: String, dataType: DataType, nullable: > Boolean, metadata: Map[String, Any] = Map.empty) > {code} > For ML, we can store feature information like categorical/continuous, number > categories, category-to-index map, etc. > One question is how to carry over the metadata in query execution. For > example: > {code} > val features = schemaRDD.select('features) > val featuresDesc = features.schema('features).metadata > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3842) Remove the hacks for Python callback server in py4j
Davies Liu created SPARK-3842: - Summary: Remove the hacks for Python callback server in py4j Key: SPARK-3842 URL: https://issues.apache.org/jira/browse/SPARK-3842 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Reporter: Davies Liu Priority: Minor There are three hacks while create Python API for Streaming (https://github.com/apache/spark/pull/2538) : 1. daemonize the callback server thread, by 'thread.daemon = True' before start it https://github.com/bartdag/py4j/issues/147 2. let callback server bind to random port, then update the Java callback client with real port. https://github.com/bartdag/py4j/issues/148 3. start the callback server later. https://github.com/bartdag/py4j/issues/149 These hacks should be removed after py4j has fix these issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3829) Make Spark logo image on the header of HistoryPage as a link to HistoryPage's page #1
[ https://issues.apache.org/jira/browse/SPARK-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3829. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Kousuke Saruta > Make Spark logo image on the header of HistoryPage as a link to HistoryPage's > page #1 > - > > Key: SPARK-3829 > URL: https://issues.apache.org/jira/browse/SPARK-3829 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > There is a Spark logo on the header of HistoryPage. > We can have too many HistoryPages if we run 20+ applications. So I think, > it's useful if the logo is as a link to the HistoryPage's page #1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3820) Specialize columnSimilarity() without any threshold
[ https://issues.apache.org/jira/browse/SPARK-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162814#comment-14162814 ] Reza Zadeh commented on SPARK-3820: --- I will do an experiment to see if the random number generation is adding significant overhead, and if it is, then add a flag to avoid it when threshold zero is given. > Specialize columnSimilarity() without any threshold > --- > > Key: SPARK-3820 > URL: https://issues.apache.org/jira/browse/SPARK-3820 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Reza Zadeh > > `RowMatrix.columnSimilarities` calls `RowMatrix.columnSimilarity(0.0)` to > compute the exact cosine similarities. It still requires sampling, which is > unnecessary for this case. We should have a specialized version for it, in > order to have a fair comparison with DIMSUM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3398. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2339 [https://github.com/apache/spark/pull/2339] > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3832) Upgrade Breeze dependency to 0.10
[ https://issues.apache.org/jira/browse/SPARK-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3832. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2693 [https://github.com/apache/spark/pull/2693] > Upgrade Breeze dependency to 0.10 > - > > Key: SPARK-3832 > URL: https://issues.apache.org/jira/browse/SPARK-3832 > Project: Spark > Issue Type: Task > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.2.0 > > > In Breeze 0.10, the L1regParam can be configured through anonymous function > in OWLQN, and each component can be penalized differently. This is required > for GLMNET in MLlib with L1/L2 regularization. > https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3832) Upgrade Breeze dependency to 0.10
[ https://issues.apache.org/jira/browse/SPARK-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3832: - Assignee: DB Tsai > Upgrade Breeze dependency to 0.10 > - > > Key: SPARK-3832 > URL: https://issues.apache.org/jira/browse/SPARK-3832 > Project: Spark > Issue Type: Task > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai > Fix For: 1.2.0 > > > In Breeze 0.10, the L1regParam can be configured through anonymous function > in OWLQN, and each component can be penalized differently. This is required > for GLMNET in MLlib with L1/L2 regularization. > https://github.com/scalanlp/breeze/commit/2570911026aa05aa1908ccf7370bc19cd8808a4c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3841) Pretty-print Params case classes for tests
[ https://issues.apache.org/jira/browse/SPARK-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162795#comment-14162795 ] Apache Spark commented on SPARK-3841: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/2700 > Pretty-print Params case classes for tests > -- > > Key: SPARK-3841 > URL: https://issues.apache.org/jira/browse/SPARK-3841 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Provide a parent class for the Params case classes used in many MLlib > examples, where the parent class pretty-prints the case class fields: > Param1NameParam1Value > Param2NameParam2Value > ... > Using this class will make it easier to print test settings to logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3838) Python code example for Word2Vec in user guide
[ https://issues.apache.org/jira/browse/SPARK-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3838: - Assignee: (was: Liquan Pei) > Python code example for Word2Vec in user guide > -- > > Key: SPARK-3838 > URL: https://issues.apache.org/jira/browse/SPARK-3838 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3486) Add PySpark support for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-3486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3486. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2356 [https://github.com/apache/spark/pull/2356] > Add PySpark support for Word2Vec > > > Key: SPARK-3486 > URL: https://issues.apache.org/jira/browse/SPARK-3486 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Liquan Pei >Assignee: Liquan Pei > Fix For: 1.2.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > Add PySpark support for Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3841) Pretty-print Params case classes for tests
Joseph K. Bradley created SPARK-3841: Summary: Pretty-print Params case classes for tests Key: SPARK-3841 URL: https://issues.apache.org/jira/browse/SPARK-3841 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Provide a parent class for the Params case classes used in many MLlib examples, where the parent class pretty-prints the case class fields: Param1Name Param1Type Param1Value Param2Name Param2Type Param2Value ... Using this class will make it easier to print test settings to logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3841) Pretty-print Params case classes for tests
[ https://issues.apache.org/jira/browse/SPARK-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3841: - Description: Provide a parent class for the Params case classes used in many MLlib examples, where the parent class pretty-prints the case class fields: Param1Name Param1Value Param2Name Param2Value ... Using this class will make it easier to print test settings to logs. was: Provide a parent class for the Params case classes used in many MLlib examples, where the parent class pretty-prints the case class fields: Param1Name Param1Type Param1Value Param2Name Param2Type Param2Value ... Using this class will make it easier to print test settings to logs. > Pretty-print Params case classes for tests > -- > > Key: SPARK-3841 > URL: https://issues.apache.org/jira/browse/SPARK-3841 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Provide a parent class for the Params case classes used in many MLlib > examples, where the parent class pretty-prints the case class fields: > Param1NameParam1Value > Param2NameParam2Value > ... > Using this class will make it easier to print test settings to logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3790) CosineSimilarity via DIMSUM example
[ https://issues.apache.org/jira/browse/SPARK-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3790. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2622 [https://github.com/apache/spark/pull/2622] > CosineSimilarity via DIMSUM example > --- > > Key: SPARK-3790 > URL: https://issues.apache.org/jira/browse/SPARK-3790 > Project: Spark > Issue Type: Improvement >Reporter: Reza Zadeh >Assignee: Reza Zadeh > Fix For: 1.2.0 > > > Create an example that gives approximation error for DIMSUM using arbitrary > RowMatrix given via commandline. > PR tracking this: > https://github.com/apache/spark/pull/2622 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3840) Spark EC2 templates fail when variables are missing
[ https://issues.apache.org/jira/browse/SPARK-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162783#comment-14162783 ] Allan Douglas R. de Oliveira commented on SPARK-3840: - PR: https://github.com/mesos/spark-ec2/pull/74 > Spark EC2 templates fail when variables are missing > --- > > Key: SPARK-3840 > URL: https://issues.apache.org/jira/browse/SPARK-3840 > Project: Spark > Issue Type: Bug > Components: EC2 >Reporter: Allan Douglas R. de Oliveira > > For instance https://github.com/mesos/spark-ec2/pull/58 introduced this > problem when AWS_ACCESS_KEY_ID isn't set: > Configuring /root/shark/conf/shark-env.sh > Traceback (most recent call last): > File "./deploy_templates.py", line 91, in > text = text.replace("{{" + key + "}}", template_vars[key]) > TypeError: expected a character buffer object > This makes all the cluster configuration fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator
[ https://issues.apache.org/jira/browse/SPARK-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162757#comment-14162757 ] Apache Spark commented on SPARK-3654: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2698 > Implement all extended HiveQL statements/commands with a separate parser > combinator > --- > > Key: SPARK-3654 > URL: https://issues.apache.org/jira/browse/SPARK-3654 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > Fix For: 1.2.0 > > > Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. > are currently parsed in a quite hacky way, like this: > {code} > if (sql.trim.toLowerCase.startsWith("cache table")) { > sql.trim.toLowerCase.startsWith("cache table") match { > ... > } > } > {code} > It would be much better to add an extra parser combinator that parses these > syntax extensions first, and then fallback to the normal Hive parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162755#comment-14162755 ] Shivaram Venkataraman commented on SPARK-3821: -- 1. Yes - the same stuff is installed on master and slaves. In fact they have the same AMI. 2. The base Spark AMI is created using `create_image.sh` (from a base Amazon AMI) -- After that we pass in the AMI-ID to `spark_ec2.py` which calls `setup.sh` on the master. > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3777) Display "Executor ID" for Tasks in Stage page
[ https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3777. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: 1.1.1, 1.2.0 > Display "Executor ID" for Tasks in Stage page > - > > Key: SPARK-3777 > URL: https://issues.apache.org/jira/browse/SPARK-3777 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0, 1.0.2, 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Labels: easy > Fix For: 1.1.1, 1.2.0 > > > Now the Stage page only displays "Executor"(host) for tasks. However, there > may be more than one Executors running in the same host. Currently, when some > task is hung, I only know the host of the faulty executor. Therefore I have > to check all executors in the host. > Adding "Executor ID" would be helpful to locate the faulty executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3777) Display "Executor ID" for Tasks in Stage page
[ https://issues.apache.org/jira/browse/SPARK-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3777: - Assignee: Shixiong Zhu > Display "Executor ID" for Tasks in Stage page > - > > Key: SPARK-3777 > URL: https://issues.apache.org/jira/browse/SPARK-3777 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.0.0, 1.0.2, 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Labels: easy > Fix For: 1.1.1, 1.2.0 > > > Now the Stage page only displays "Executor"(host) for tasks. However, there > may be more than one Executors running in the same host. Currently, when some > task is hung, I only know the host of the faulty executor. Therefore I have > to check all executors in the host. > Adding "Executor ID" would be helpful to locate the faulty executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3840) Spark EC2 templates fail when variables are missing
Allan Douglas R. de Oliveira created SPARK-3840: --- Summary: Spark EC2 templates fail when variables are missing Key: SPARK-3840 URL: https://issues.apache.org/jira/browse/SPARK-3840 Project: Spark Issue Type: Bug Components: EC2 Reporter: Allan Douglas R. de Oliveira For instance https://github.com/mesos/spark-ec2/pull/58 introduced this problem when AWS_ACCESS_KEY_ID isn't set: Configuring /root/shark/conf/shark-env.sh Traceback (most recent call last): File "./deploy_templates.py", line 91, in text = text.replace("{{" + key + "}}", template_vars[key]) TypeError: expected a character buffer object This makes all the cluster configuration fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3661) spark.*.memory is ignored in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162721#comment-14162721 ] Apache Spark commented on SPARK-3661: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2697 > spark.*.memory is ignored in cluster mode > - > > Key: SPARK-3661 > URL: https://issues.apache.org/jira/browse/SPARK-3661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for > the config. Note that `spark.executor.memory` is fine only in standalone and > mesos mode because we pass the Spark system properties to the driver after it > has started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3661) spark.*.memory is ignored in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3661: - Description: This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that `spark.executor.memory` is fine only in standalone and mesos mode because we pass the Spark system properties to the driver after it has started. (was: This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that `spark.executor.memory` is fine only in standalone mode because we pass the Spark system properties to the driver after it has started.) > spark.*.memory is ignored in cluster mode > - > > Key: SPARK-3661 > URL: https://issues.apache.org/jira/browse/SPARK-3661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for > the config. Note that `spark.executor.memory` is fine only in standalone and > mesos mode because we pass the Spark system properties to the driver after it > has started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3661) spark.*.memory is ignored in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3661: - Description: This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that `spark.executor.memory` is fine only in standalone mode because we pass the Spark system properties to the driver after it has started. (was: This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for the config. Note that `spark.executor.memory` is fine because we pass the Spark system properties to the driver after it has started.) > spark.*.memory is ignored in cluster mode > - > > Key: SPARK-3661 > URL: https://issues.apache.org/jira/browse/SPARK-3661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for > the config. Note that `spark.executor.memory` is fine only in standalone mode > because we pass the Spark system properties to the driver after it has > started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3661) spark.*.memory is ignored in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3661: - Summary: spark.*.memory is ignored in cluster mode (was: spark.driver.memory is ignored in cluster mode) > spark.*.memory is ignored in cluster mode > - > > Key: SPARK-3661 > URL: https://issues.apache.org/jira/browse/SPARK-3661 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for > the config. Note that `spark.executor.memory` is fine because we pass the > Spark system properties to the driver after it has started. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3839) Reimplement HashOuterJoin to construct hash table of only one relation
Liquan Pei created SPARK-3839: - Summary: Reimplement HashOuterJoin to construct hash table of only one relation Key: SPARK-3839 URL: https://issues.apache.org/jira/browse/SPARK-3839 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Liquan Pei Currently, in HashOuterJoin, we build hash tables for both relations, however, in left/right outer join, we only need to build hash table for one relation. For example, for left hash join, we build a hash table for the right relation and stream the left relation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3838) Python code example for Word2Vec in user guide
Xiangrui Meng created SPARK-3838: Summary: Python code example for Word2Vec in user guide Key: SPARK-3838 URL: https://issues.apache.org/jira/browse/SPARK-3838 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Liquan Pei Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162685#comment-14162685 ] Sandy Ryza commented on SPARK-3174: --- bq. Maybe it makes sense to just call it `spark.dynamicAllocation.*` That sounds good to me. bq. I think in general we should limit the number of things that will affect adding/removing executors. Otherwise an application might get/lose many executors all of a sudden without a good understanding of why. Also anticipating what's needed in a future stage is usually fairly difficult, because you don't know a priori how long each stage is running. I don't see a good metric to decide how far in the future to anticipate for. Consider the (common) case of a user keeping a Hive session open and setting a low number of minimum executors in order to not sit on cluster resources when idle. Goal number 1 should be making queries return as fast as possible. A policy that, upon receiving a job, simply requested executors with enough slots to handle all the tasks required by the first stage would be a vast latency and user experience improvement over the exponential increase policy. Given that resource managers like YARN will mediate fairness between users and that Spark will be able to give executors back, there's not much advantage to being conservative or ramping up slowly in this case. Accurately anticipating resource needs is difficult, but not necessary. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162663#comment-14162663 ] Nicholas Chammas commented on SPARK-3821: - [~shivaram] / [~pwendell]: # In a Spark cluster, what's the difference between what's installed on the master and what's installed on the slaves? Is it basically the same stuff, just with minor configuration changes? # Starting from a base AMI, is the rough procedure for creating a fully built Spark instance simply running [{{create_image.sh}}|https://github.com/mesos/spark-ec2/blob/v3/create_image.sh] followed by [{{setup.sh}}|https://github.com/mesos/spark-ec2/blob/v3/setup.sh] (minus the stuff that connects to other instances)? > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162662#comment-14162662 ] Reza Farivar edited comment on SPARK-3785 at 10/7/14 10:12 PM: --- Note: project Sumatra might become a part of Java 9, so we might get official GPU support in Java some time in the future. Sean, I agree that the memory copying is an overhead, but for the right application it can become small enough to ignore. Also, you can apply a series of operations on an RDD before moving it back to the CPU land. Think rdd.map(x => sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the RDD could mean we can run a whole stage in the GPU land, with each task would run on different GPU in the cluster not needing to get back in the CPU land until we get to a collect() or groupBy(), etc. I imagine we can have a subclass of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when the runtask() is called. In fact, given that we have a good number of specialized RDDs, I think we could have specialized GPU versions of them easily (say, the CartesianRDD for instance). Where it gets tougher is in the mappedRDD function, where you would want to pass the arbitrary function to the GPU and hope that it runs. was (Author: rfarivar): I thought to add that the project Sumatra might become a part of Java 9, so we might get official GPU support in Java some time in the future. Sean, I agree that the memory copying is an overhead, but for the right application it can become small enough to ignore. Also, you can apply a series of operations on an RDD before moving it back to the CPU land. Think rdd.map(x => sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the RDD could mean we can run a whole stage in the GPU land, with each task would run on different GPU in the cluster not needing to get back in the CPU land until we get to a collect() or groupBy(), etc. I imagine we can have a subclass of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when the runtask() is called. In fact, given that we have a good number of specialized RDDs, I think we could have specialized GPU versions of them easily (say, the CartesianRDD for instance). Where it gets tougher is in the mappedRDD function, where you would want to pass the arbitrary function to the GPU and hope that it runs. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162662#comment-14162662 ] Reza Farivar edited comment on SPARK-3785 at 10/7/14 10:13 PM: --- Note: project Sumatra might become a part of Java 9, so we might get official GPU support in Java some time in the future. Sean, I agree that the memory copying is an overhead, but for the right application it can become small enough to ignore. Also, you can apply a series of operations on an RDD before moving it back to the CPU land. Think rdd.map(x => sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the RDD could mean we can run a whole stage in the GPU land, with each task would run on a different GPU in the cluster not needing to get back in the CPU land until we get to a collect() or groupBy(), etc. I imagine we can have a subclass of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when the runtask() is called. In fact, given that we have a good number of specialized RDDs, I think we could have specialized GPU versions of them easily (say, the CartesianRDD for instance). Where it gets tougher is in the mappedRDD function, where you would want to pass the arbitrary function to the GPU and hope that it runs. was (Author: rfarivar): Note: project Sumatra might become a part of Java 9, so we might get official GPU support in Java some time in the future. Sean, I agree that the memory copying is an overhead, but for the right application it can become small enough to ignore. Also, you can apply a series of operations on an RDD before moving it back to the CPU land. Think rdd.map(x => sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the RDD could mean we can run a whole stage in the GPU land, with each task would run on different GPU in the cluster not needing to get back in the CPU land until we get to a collect() or groupBy(), etc. I imagine we can have a subclass of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when the runtask() is called. In fact, given that we have a good number of specialized RDDs, I think we could have specialized GPU versions of them easily (say, the CartesianRDD for instance). Where it gets tougher is in the mappedRDD function, where you would want to pass the arbitrary function to the GPU and hope that it runs. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162662#comment-14162662 ] Reza Farivar commented on SPARK-3785: - I thought to add that the project Sumatra might become a part of Java 9, so we might get official GPU support in Java some time in the future. Sean, I agree that the memory copying is an overhead, but for the right application it can become small enough to ignore. Also, you can apply a series of operations on an RDD before moving it back to the CPU land. Think rdd.map(x => sine(x)*x).filter( _ < 100).map(x=> 1/x)... The distributed nature of the RDD could mean we can run a whole stage in the GPU land, with each task would run on different GPU in the cluster not needing to get back in the CPU land until we get to a collect() or groupBy(), etc. I imagine we can have a subclass of ShuffleMapTask that lives in the GPU land and would call a GPU kernel when the runtask() is called. In fact, given that we have a good number of specialized RDDs, I think we could have specialized GPU versions of them easily (say, the CartesianRDD for instance). Where it gets tougher is in the mappedRDD function, where you would want to pass the arbitrary function to the GPU and hope that it runs. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI
[ https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3682: -- Attachment: SPARK-3682Design.pdf Posting an initial design > Add helpful warnings to the UI > -- > > Key: SPARK-3682 > URL: https://issues.apache.org/jira/browse/SPARK-3682 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Sandy Ryza > Attachments: SPARK-3682Design.pdf > > > Spark has a zillion configuration options and a zillion different things that > can go wrong with a job. Improvements like incremental and better metrics > and the proposed spark replay debugger provide more insight into what's going > on under the covers. However, it's difficult for non-advanced users to > synthesize this information and understand where to direct their attention. > It would be helpful to have some sort of central location on the UI users > could go to that would provide indications about why an app/job is failing or > performing poorly. > Some helpful messages that we could provide: > * Warn that the tasks in a particular stage are spending a long time in GC. > * Warn that spark.shuffle.memoryFraction does not fit inside the young > generation. > * Warn that tasks in a particular stage are very short, and that the number > of partitions should probably be decreased. > * Warn that tasks in a particular stage are spilling a lot, and that the > number of partitions should probably be increased. > * Warn that a cached RDD that gets a lot of use does not fit in memory, and a > lot of time is being spent recomputing it. > To start, probably two kinds of warnings would be most helpful. > * Warnings at the app level that report on misconfigurations, issues with the > general health of executors. > * Warnings at the job level that indicate why a job might be performing > slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162615#comment-14162615 ] Josh Rosen commented on SPARK-2321: --- I've opened a WIP pull request in order to discuss the design / implementation of a pull-based progress / status API: https://github.com/apache/spark/pull/2696. I'd like to focus on discussing the most high-level interface / API design decisions now; once we're happy with those decisions, we can focus on the details of which pieces of data to expose, etc. > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162614#comment-14162614 ] Apache Spark commented on SPARK-2321: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2696 > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3637) NPE in ShuffleMapTask
[ https://issues.apache.org/jira/browse/SPARK-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162600#comment-14162600 ] Steven Lewis commented on SPARK-3637: - I see the same thing running a Java Map reduce program locally and it is a blocking issue for my development especially since I have no clue as to how to address it > NPE in ShuffleMapTask > - > > Key: SPARK-3637 > URL: https://issues.apache.org/jira/browse/SPARK-3637 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Przemyslaw Pastuszka > > When trying to execute spark.jobserver.WordCountExample using spark-jobserver > (https://github.com/ooyala/spark-jobserver) we observed that often it fails > with NullPointerException in ShuffleMapTask.scala. Here are full details: > {code} > Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 1.0 (TID 6, > hadoop-simple-768-worker-with-zookeeper-0): java.lang.NullPointerException: > \njava.nio.ByteBuffer.wrap(ByteBuffer.java:392)\n > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:61)\n > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)\n > org.apache.spark.scheduler.Task.run(Task.scala:54)\n > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)\n > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\n > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n > java.lang.Thread.run(Thread.java:745)\nDriver stacktrace:", > "errorClass": "org.apache.spark.SparkException", > "stack": > ["org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)", > > "org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)", > > "org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)", > > "scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)", > "scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)", > "org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)", > > "org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)", > > "org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)", > "scala.Option.foreach(Option.scala:236)", > "org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682)", > > "org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359)", > "akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)", > "akka.actor.ActorCell.invoke(ActorCell.scala:456)", > "akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)", > "akka.dispatch.Mailbox.run(Mailbox.scala:219)", > "akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)", > "scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)", > "scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)", > "scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)", > "scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)" > {code} > I am aware, that this failure may be due to the job being ill-defined by > spark-jobserver (I don't know if that's the case), but if so, then it should > be handled more gratefully on spark side. > What's also important, that this issue doesn't happen always, which may > indicate some type of race condition in the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3837) Warn when YARN is killing containers for exceeding memory limits
Sandy Ryza created SPARK-3837: - Summary: Warn when YARN is killing containers for exceeding memory limits Key: SPARK-3837 URL: https://issues.apache.org/jira/browse/SPARK-3837 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza YARN now lets application masters know when it kills their containers for exceeding memory limits. Spark should log something when this happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162547#comment-14162547 ] Davies Liu commented on SPARK-3461: --- [~pwendell] I will start to work on this after merging https://github.com/apache/spark/pull/1977 > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3174) Provide elastic scaling within a Spark application
[ https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162514#comment-14162514 ] Marcelo Vanzin commented on SPARK-3174: --- Hi Andrew, thanks for writing this up. My first question I think is similar to Tom's. It was not clear to me how the app will behave when it starts up. I'd expect the first job to be the one that has to process the largest amount of data, so it would benefit from having as many executors as possible available as quickly as possible - something that seems to conflict with the idea of a slow start. Are you proposing a change to the current semantics, where Yarn will request "--num-executors" up front? If you keep that, I think that would cover my above concerns. But switching to a slow start with no option to pre-allocate a certain numbers seems like it might harm certain jobs. My second is about the shuffle service you're proposing. Have you investigated whether it would be possible to make Hadoop's shuffle service more generic, so that Spark can benefit from it? It does mean that this feature might be constrained to certain versions of Hadoop, but maybe that's not necessarily a bad thing if it means more infrastructure is shared. > Provide elastic scaling within a Spark application > -- > > Key: SPARK-3174 > URL: https://issues.apache.org/jira/browse/SPARK-3174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Andrew Or > Attachments: SPARK-3174design.pdf, > dynamic-scaling-executors-10-6-14.pdf > > > A common complaint with Spark in a multi-tenant environment is that > applications have a fixed allocation that doesn't grow and shrink with their > resource needs. We're blocked on YARN-1197 for dynamically changing the > resources within executors, but we can still allocate and discard whole > executors. > It would be useful to have some heuristics that > * Request more executors when many pending tasks are building up > * Discard executors when they are idle > See the latest design doc for more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3836) Spark REPL optionally propagate internal exceptions
[ https://issues.apache.org/jira/browse/SPARK-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162470#comment-14162470 ] Apache Spark commented on SPARK-3836: - User 'ahirreddy' has created a pull request for this issue: https://github.com/apache/spark/pull/2695 > Spark REPL optionally propagate internal exceptions > > > Key: SPARK-3836 > URL: https://issues.apache.org/jira/browse/SPARK-3836 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ahir Reddy >Priority: Minor > > Optionally have the repl throw exceptions generated by interpreted code, > instead of swallowing the exception and returning it as text output. This is > useful when embedding the repl, otherwise it's not possible to know when user > code threw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3836) Spark REPL optionally propagate internal exceptions
Ahir Reddy created SPARK-3836: - Summary: Spark REPL optionally propagate internal exceptions Key: SPARK-3836 URL: https://issues.apache.org/jira/browse/SPARK-3836 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ahir Reddy Priority: Minor Optionally have the repl throw exceptions generated by interpreted code, instead of swallowing the exception and returning it as text output. This is useful when embedding the repl, otherwise it's not possible to know when user code threw an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3731. --- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Fixed by Davies' PR, which I backported to 1.1. > RDD caching stops working in pyspark after some time > > > Key: SPARK-3731 > URL: https://issues.apache.org/jira/browse/SPARK-3731 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.0.2, 1.1.0, 1.2.0 > Environment: Linux, 32bit, both in local mode or in standalone > cluster mode >Reporter: Milan Straka >Assignee: Davies Liu >Priority: Critical > Fix For: 1.1.1, 1.2.0 > > Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, > worker.log > > > Consider a file F which when loaded with sc.textFile and cached takes up > slightly more than half of free memory for RDD cache. > When in PySpark the following is executed: > 1) a = sc.textFile(F) > 2) a.cache().count() > 3) b = sc.textFile(F) > 4) b.cache().count() > and then the following is repeated (for example 10 times): > a) a.unpersist().cache().count() > b) b.unpersist().cache().count() > after some time, there are no RDD cached in memory. > Also, since that time, no other RDD ever gets cached (the worker always > reports something like "WARN CacheManager: Not enough space to cache > partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if > rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that > all executors have 0MB memory used (which is consistent with the CacheManager > warning). > When doing the same in scala, everything works perfectly. > I understand that this is a vague description, but I do no know how to > describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162439#comment-14162439 ] Xiangrui Meng commented on SPARK-3828: -- I re-opened this because it may be a serious problem. Usually, the line reader skips the first record if the start pos is not 0 and always reads one extra record after the end pos. In the case [~liquanpei] found, the reader for the second partition doesn't skip the first record. So there exists duplicate and incorrect content in the resulting RDD. This could be a bug in Hadoop 2.4.0. But since Spark heavily depends on `sc.textFile`, it is worth figuring out why. > Spark returns inconsistent results when building with different Hadoop > version > --- > > Key: SPARK-3828 > URL: https://issues.apache.org/jira/browse/SPARK-3828 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: OSX 10.9, Spark master branch >Reporter: Liquan Pei > > For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please > unzip first. > Spark build with different Hadoop version returns different result. > {code} > val data = sc.textFile("text8") > data.count() > {code} > returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built > with SPARK_HADOOP_VERSION=2.4.0. > Looking through the rdd code, it seems that textFile uses hadoopFile which > creates HadoopRDD, we should probably create newHadoopRDD when building spark > with SPARK_HADOOP_VERSION >= 2.0.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-3828: -- > Spark returns inconsistent results when building with different Hadoop > version > --- > > Key: SPARK-3828 > URL: https://issues.apache.org/jira/browse/SPARK-3828 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: OSX 10.9, Spark master branch >Reporter: Liquan Pei > > For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please > unzip first. > Spark build with different Hadoop version returns different result. > {code} > val data = sc.textFile("text8") > data.count() > {code} > returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built > with SPARK_HADOOP_VERSION=2.4.0. > Looking through the rdd code, it seems that textFile uses hadoopFile which > creates HadoopRDD, we should probably create newHadoopRDD when building spark > with SPARK_HADOOP_VERSION >= 2.0.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3825) Log more information when unrolling a block fails
[ https://issues.apache.org/jira/browse/SPARK-3825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3825. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 > Log more information when unrolling a block fails > - > > Key: SPARK-3825 > URL: https://issues.apache.org/jira/browse/SPARK-3825 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.1.1, 1.2.0 > > > We currently log only the following: > {code} > 14/10/06 16:45:42 WARN CacheManager: Not enough space to cache partition > rdd_0_2 in memory! Free memory is 481861527 bytes. > {code} > This is confusing, however, because "free memory" here means the amount of > memory not occupied by blocks. It does not include the amount of memory > reserved for unrolling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3816) Add configureOutputJobPropertiesForStorageHandler to JobConf in SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-3816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162420#comment-14162420 ] Apache Spark commented on SPARK-3816: - User 'alexliu68' has created a pull request for this issue: https://github.com/apache/spark/pull/2677 > Add configureOutputJobPropertiesForStorageHandler to JobConf in > SparkHadoopWriter > - > > Key: SPARK-3816 > URL: https://issues.apache.org/jira/browse/SPARK-3816 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Alex Liu > > It's similar to SPARK-2846. We should add > PlanUtils.configureInputJobPropertiesForStorageHandler to SparkHadoopWriter, > so that writer can add configuration from customized StorageHandler to JobConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3790) CosineSimilarity via DIMSUM example
[ https://issues.apache.org/jira/browse/SPARK-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162415#comment-14162415 ] Apache Spark commented on SPARK-3790: - User 'rezazadeh' has created a pull request for this issue: https://github.com/apache/spark/pull/2622 > CosineSimilarity via DIMSUM example > --- > > Key: SPARK-3790 > URL: https://issues.apache.org/jira/browse/SPARK-3790 > Project: Spark > Issue Type: Improvement >Reporter: Reza Zadeh >Assignee: Reza Zadeh > > Create an example that gives approximation error for DIMSUM using arbitrary > RowMatrix given via commandline. > PR tracking this: > https://github.com/apache/spark/pull/2622 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2759) The ability to read binary files into Spark
[ https://issues.apache.org/jira/browse/SPARK-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162403#comment-14162403 ] Apache Spark commented on SPARK-2759: - User 'kmader' has created a pull request for this issue: https://github.com/apache/spark/pull/1658 > The ability to read binary files into Spark > --- > > Key: SPARK-2759 > URL: https://issues.apache.org/jira/browse/SPARK-2759 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Java API, Spark Core >Reporter: Kevin Mader > > For reading images, compressed files, or other custom formats it would be > useful to have methods that could read the files in as a byte array or > DataInputStream so other functions could then process the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3166) Custom serialisers can't be shipped in application jars
[ https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162409#comment-14162409 ] Apache Spark commented on SPARK-3166: - User 'GrahamDennis' has created a pull request for this issue: https://github.com/apache/spark/pull/1890 > Custom serialisers can't be shipped in application jars > --- > > Key: SPARK-3166 > URL: https://issues.apache.org/jira/browse/SPARK-3166 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Graham Dennis > > Spark cannot currently use a custom serialiser that is shipped with the > application jar. Trying to do this causes a java.lang.ClassNotFoundException > when trying to instantiate the custom serialiser in the Executor processes. > This occurs because Spark attempts to instantiate the custom serialiser > before the application jar has been shipped to the Executor process. A > reproduction of the problem is available here: > https://github.com/GrahamDennis/spark-custom-serialiser > I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches > as of August 21, 2014. This issue is related to SPARK-2878, and my fix for > that issue (https://github.com/apache/spark/pull/1890) also solves this. My > pull request was not merged because it adds the user jar to the Executor > processes' class path at launch time. Such a significant change was thought > by [~rxin] to require more QA, and should be considered for inclusion in 1.2 > at the earliest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array
[ https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162407#comment-14162407 ] Apache Spark commented on SPARK-2489: - User 'joesu' has created a pull request for this issue: https://github.com/apache/spark/pull/1737 > Unsupported parquet datatype optional fixed_len_byte_array > -- > > Key: SPARK-2489 > URL: https://issues.apache.org/jira/browse/SPARK-2489 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Pei-Lun Lee > > tested against commit 9fe693b5 > {noformat} > scala> sqlContext.parquetFile("/tmp/foo") > java.lang.RuntimeException: Unsupported parquet datatype optional > fixed_len_byte_array(4) b > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279) > {noformat} > example avro schema > {noformat} > protocol Test { > fixed Bytes4(4); > record Foo { > union {null, Bytes4} b; > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages
[ https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162413#comment-14162413 ] Apache Spark commented on SPARK-3580: - User 'patmcdonough' has created a pull request for this issue: https://github.com/apache/spark/pull/2447 > Add Consistent Method To Get Number of RDD Partitions Across Different > Languages > > > Key: SPARK-3580 > URL: https://issues.apache.org/jira/browse/SPARK-3580 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Pat McDonough > Labels: starter > > Programmatically retrieving the number of partitions is not consistent > between python and scala. A consistent method should be defined and made > public across both languages. > RDD.partitions.size is also used quite frequently throughout the internal > code, so that might be worth refactoring as well once the new method is > available. > What we have today is below. > In Scala: > {code} > scala> someRDD.partitions.size > res0: Int = 30 > {code} > In Python: > {code} > In [2]: someRDD.getNumPartitions() > Out[2]: 30 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large
[ https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162405#comment-14162405 ] Apache Spark commented on SPARK-2016: - User 'carlosfuertes' has created a pull request for this issue: https://github.com/apache/spark/pull/1682 > rdd in-memory storage UI becomes unresponsive when the number of RDD > partitions is large > > > Key: SPARK-2016 > URL: https://issues.apache.org/jira/browse/SPARK-2016 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > Labels: starter > > Try run > {code} > sc.parallelize(1 to 100, 100).cache().count() > {code} > And open the storage UI for this RDD. It takes forever to load the page. > When the number of partitions is very large, I think there are a few > alternatives: > 0. Only show the top 1000. > 1. Pagination > 2. Instead of grouping by RDD blocks, group by executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3812) Adapt maven build to publish effective pom.
[ https://issues.apache.org/jira/browse/SPARK-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162417#comment-14162417 ] Apache Spark commented on SPARK-3812: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/2673 > Adapt maven build to publish effective pom. > --- > > Key: SPARK-3812 > URL: https://issues.apache.org/jira/browse/SPARK-3812 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Prashant Sharma > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3338) Respect user setting of spark.submit.pyFiles
[ https://issues.apache.org/jira/browse/SPARK-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162411#comment-14162411 ] Apache Spark commented on SPARK-3338: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2232 > Respect user setting of spark.submit.pyFiles > > > Key: SPARK-3338 > URL: https://issues.apache.org/jira/browse/SPARK-3338 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > > We currently override any setting of spark.submit.pyFiles. Even though this > is not documented, we should still respect this if the user explicitly sets > this in his/her default properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3251) Clarify learning interfaces
[ https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162410#comment-14162410 ] Apache Spark commented on SPARK-3251: - User 'BigCrunsh' has created a pull request for this issue: https://github.com/apache/spark/pull/2137 > Clarify learning interfaces > > > Key: SPARK-3251 > URL: https://issues.apache.org/jira/browse/SPARK-3251 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0, 1.1.1 >Reporter: Christoph Sawade > > *Make threshold mandatory* > Currently, the output of predict for an example is either the score > or the class. This side-effect is caused by clearThreshold. To > clarify that behaviour three different types of predict (predictScore, > predictClass, predictProbabilty) were introduced; the threshold is not > longer optional. > *Clarify classification interfaces* > Currently, some functionality is spreaded over multiple models. > In order to clarify the structure and simplify the implementation of > more complex models (like multinomial logistic regression), two new > classes are introduced: > - BinaryClassificationModel: for all models that derives a binary > classification from a single weight vector. Comprises the tresholding > functionality to derive a prediction from a score. It basically captures > SVMModel and LogisticRegressionModel. > - ProbabilitistClassificaitonModel: This trait defines the interface for > models that return a calibrated confidence score (aka probability). > *Misc* > - some renaming > - add test for probabilistic output -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162404#comment-14162404 ] Apache Spark commented on SPARK-2017: - User 'carlosfuertes' has created a pull request for this issue: https://github.com/apache/spark/pull/1682 > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Reynold Xin > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2803) add Kafka stream feature for fetch messages from specified starting offset position
[ https://issues.apache.org/jira/browse/SPARK-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162402#comment-14162402 ] Apache Spark commented on SPARK-2803: - User 'pengyanhong' has created a pull request for this issue: https://github.com/apache/spark/pull/1602 > add Kafka stream feature for fetch messages from specified starting offset > position > --- > > Key: SPARK-2803 > URL: https://issues.apache.org/jira/browse/SPARK-2803 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: pengyanhong > > There are some use cases that we want to fetch message from specified offset > position, as below: > * replay messages > * deal with transaction > * skip bulk incorrect messages > * random fetch message according to index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2805) update akka to version 2.3
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162406#comment-14162406 ] Apache Spark commented on SPARK-2805: - User 'avati' has created a pull request for this issue: https://github.com/apache/spark/pull/1685 > update akka to version 2.3 > -- > > Key: SPARK-2805 > URL: https://issues.apache.org/jira/browse/SPARK-2805 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Reporter: Anand Avati > > akka-2.3 is the lowest version available in Scala 2.11 > akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order > to reconcile the conflicting dependencies, need to release > akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3809) make HiveThriftServer2Suite work correctly
[ https://issues.apache.org/jira/browse/SPARK-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162418#comment-14162418 ] Apache Spark commented on SPARK-3809: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/2675 > make HiveThriftServer2Suite work correctly > -- > > Key: SPARK-3809 > URL: https://issues.apache.org/jira/browse/SPARK-3809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > Currently HiveThriftServer2Suite is a fake one, actually HiveThriftServer not > started there -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3505) Augmenting SparkStreaming updateStateByKey API with timestamp
[ https://issues.apache.org/jira/browse/SPARK-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162412#comment-14162412 ] Apache Spark commented on SPARK-3505: - User 'xiliu82' has created a pull request for this issue: https://github.com/apache/spark/pull/2267 > Augmenting SparkStreaming updateStateByKey API with timestamp > - > > Key: SPARK-3505 > URL: https://issues.apache.org/jira/browse/SPARK-3505 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Xi Liu >Priority: Minor > Fix For: 1.2.0 > > > The current updateStateByKey API in Spark Streaming does not expose timestamp > to the application. > In our use case, the application need to know the batch timestamp to decide > whether to keep the state or not. And we do not want to use real system time > because we want to decouple the two (because the same code base is used for > streaming and offline processing). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks
[ https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162408#comment-14162408 ] Apache Spark commented on SPARK-2960: - User 'roji' has created a pull request for this issue: https://github.com/apache/spark/pull/1875 > Spark executables fail to start via symlinks > > > Key: SPARK-2960 > URL: https://issues.apache.org/jira/browse/SPARK-2960 > Project: Spark > Issue Type: Bug >Reporter: Shay Rojansky >Priority: Minor > > The current scripts (e.g. pyspark) fail to run when they are executed via > symlinks. A common Linux scenario would be to have Spark installed somewhere > (e.g. /opt) and have a symlink to it in /usr/bin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3835) Spark applications that are killed should show up as "KILLED" or "CANCELLED" in the Spark UI
Matt Cheah created SPARK-3835: - Summary: Spark applications that are killed should show up as "KILLED" or "CANCELLED" in the Spark UI Key: SPARK-3835 URL: https://issues.apache.org/jira/browse/SPARK-3835 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Matt Cheah Spark applications that crash or are killed are listed as FINISHED in the Spark UI. It looks like the Master only passes back a list of "Running" applications and a list of "Completed" applications, All of the applications under "Completed" have status "FINISHED", however if they were killed manually they should show "CANCELLED", or if they failed they should read "FAILED". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3834) Backticks not correctly handled in subquery aliases
Michael Armbrust created SPARK-3834: --- Summary: Backticks not correctly handled in subquery aliases Key: SPARK-3834 URL: https://issues.apache.org/jira/browse/SPARK-3834 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Blocker [~ravi.pesala] assigning to you since you fixed the last problem here. Let me know if you don't have time to work on this or if you have any questions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2582) Make Block Manager Master pluggable
[ https://issues.apache.org/jira/browse/SPARK-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2582. Resolution: Won't Fix I closed this PR a long time ago - but didn't close the asscicated JIRA. > Make Block Manager Master pluggable > --- > > Key: SPARK-2582 > URL: https://issues.apache.org/jira/browse/SPARK-2582 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 1.0.0 >Reporter: Hari Shreedharan > > Today, there is no way to make the BMM pluggable. So if we want an HA BMM, > that needs to replace the current one. Making this pluggable and selected > based on a config makes it easy to select HA or non-HA one based on the > application's preference. Streaming applications would be better off with an > HA one, while a normal application would not care (since the RDDs can be > regenerated). > Since communication from the Block Managers to the BMM is via akka, we can > keep that the same and just have the implementation of the BMM implement the > actual methods which do the real work - this would not affect the Block > Managers too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162374#comment-14162374 ] Liquan Pei edited comment on SPARK-3828 at 10/7/14 7:33 PM: It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, when running {code} sc.textFile("text8").map(_.size).collect() {code} It returns, {code} Array[Int] = Array(1) {code} which is consistent with text8 file size. However, For Spark built with 2.4.0, the above code returns {code} Array[Int] = Array(1, 32891136) {code} Note that the second entry for 2.4.0 result equals to the 2nd partition size of text8, which means that the first record for one that partition is not correctly skipped. Will try to reproduce it with Hadoop. was (Author: liquanpei): It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, when running {code} sc.textFile("text8").map(_.size).collect() {code} It returns, {code} Array[Int] = Array(1) {code} which is consistent with text8 file size. However, For Spark built with 2.4.0, the above code returns {code} Array[Int] = Array(1, 32891136) {code} Note that the second entry for 2.4.0 result equals to the second partition size of text8, which means that the first record for one that partition is not correctly skipped. Will try to reproduce it with Hadoop. > Spark returns inconsistent results when building with different Hadoop > version > --- > > Key: SPARK-3828 > URL: https://issues.apache.org/jira/browse/SPARK-3828 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: OSX 10.9, Spark master branch >Reporter: Liquan Pei > > For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please > unzip first. > Spark build with different Hadoop version returns different result. > {code} > val data = sc.textFile("text8") > data.count() > {code} > returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built > with SPARK_HADOOP_VERSION=2.4.0. > Looking through the rdd code, it seems that textFile uses hadoopFile which > creates HadoopRDD, we should probably create newHadoopRDD when building spark > with SPARK_HADOOP_VERSION >= 2.0.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3828) Spark returns inconsistent results when building with different Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162374#comment-14162374 ] Liquan Pei commented on SPARK-3828: --- It seems that this is a bug in LineRecordReader. For Spark built with 1.0.4, when running {code} sc.textFile("text8").map(_.size).collect() {code} It returns, {code} Array[Int] = Array(1) {code} which is consistent with text8 file size. However, For Spark built with 2.4.0, the above code returns {code} Array[Int] = Array(1, 32891136) {code} Note that the second entry for 2.4.0 result equals to the second partition size of text8, which means that the first record for one that partition is not correctly skipped. Will try to reproduce it with Hadoop. > Spark returns inconsistent results when building with different Hadoop > version > --- > > Key: SPARK-3828 > URL: https://issues.apache.org/jira/browse/SPARK-3828 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: OSX 10.9, Spark master branch >Reporter: Liquan Pei > > For text8 data at http://mattmahoney.net/dc/text8.zip. To reproduce, please > unzip first. > Spark build with different Hadoop version returns different result. > {code} > val data = sc.textFile("text8") > data.count() > {code} > returns 1 when built with SPARK_HADOOP_VERSION=1.0.4 and return 2 when built > with SPARK_HADOOP_VERSION=2.4.0. > Looking through the rdd code, it seems that textFile uses hadoopFile which > creates HadoopRDD, we should probably create newHadoopRDD when building spark > with SPARK_HADOOP_VERSION >= 2.0.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3765) Add test information to sbt build docs
[ https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3765. Resolution: Fixed Assignee: wangfei This was resolved by: https://github.com/apache/spark/pull/2629 > Add test information to sbt build docs > -- > > Key: SPARK-3765 > URL: https://issues.apache.org/jira/browse/SPARK-3765 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: wangfei >Assignee: wangfei > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3765) Add test information to sbt build docs
[ https://issues.apache.org/jira/browse/SPARK-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3765: --- Fix Version/s: 1.2.0 > Add test information to sbt build docs > -- > > Key: SPARK-3765 > URL: https://issues.apache.org/jira/browse/SPARK-3765 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.0 >Reporter: wangfei >Assignee: wangfei > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
[ https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162351#comment-14162351 ] Patrick Wendell commented on SPARK-3797: For the dependencies issue - the plan is to create a separate build module that only contains the jar for the shuffle service so we can produce a jar with only this service and not the rest of Spark's dependency graph. This won't have any dependencies except for netty which is already a dependency of YARN and we are using the same version, and potentially the scala library jar (though we've even discussed writing this particular component in Java). I think that fully solves the issues Sandy has mentioned. BTW in general I don't think we are going to require this to run Spark-on-YARN in the future - it will just be a mode that people can run in if they want to have better elasticity. > Run the shuffle service inside the YARN NodeManager as an AuxiliaryService > -- > > Key: SPARK-3797 > URL: https://issues.apache.org/jira/browse/SPARK-3797 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Patrick Wendell >Assignee: Andrew Or > > It's also worth considering running the shuffle service in a YARN container > beside the executor(s) on each node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3731: -- Affects Version/s: 1.0.2 > RDD caching stops working in pyspark after some time > > > Key: SPARK-3731 > URL: https://issues.apache.org/jira/browse/SPARK-3731 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.0.2, 1.1.0, 1.2.0 > Environment: Linux, 32bit, both in local mode or in standalone > cluster mode >Reporter: Milan Straka >Assignee: Davies Liu >Priority: Critical > Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, > worker.log > > > Consider a file F which when loaded with sc.textFile and cached takes up > slightly more than half of free memory for RDD cache. > When in PySpark the following is executed: > 1) a = sc.textFile(F) > 2) a.cache().count() > 3) b = sc.textFile(F) > 4) b.cache().count() > and then the following is repeated (for example 10 times): > a) a.unpersist().cache().count() > b) b.unpersist().cache().count() > after some time, there are no RDD cached in memory. > Also, since that time, no other RDD ever gets cached (the worker always > reports something like "WARN CacheManager: Not enough space to cache > partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if > rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that > all executors have 0MB memory used (which is consistent with the CacheManager > warning). > When doing the same in scala, everything works perfectly. > I understand that this is a vague description, but I do no know how to > describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3731: -- Target Version/s: 1.1.1, 1.2.0 (was: 1.1.1, 1.2.0, 1.0.3) > RDD caching stops working in pyspark after some time > > > Key: SPARK-3731 > URL: https://issues.apache.org/jira/browse/SPARK-3731 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, 32bit, both in local mode or in standalone > cluster mode >Reporter: Milan Straka >Assignee: Davies Liu >Priority: Critical > Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, > worker.log > > > Consider a file F which when loaded with sc.textFile and cached takes up > slightly more than half of free memory for RDD cache. > When in PySpark the following is executed: > 1) a = sc.textFile(F) > 2) a.cache().count() > 3) b = sc.textFile(F) > 4) b.cache().count() > and then the following is repeated (for example 10 times): > a) a.unpersist().cache().count() > b) b.unpersist().cache().count() > after some time, there are no RDD cached in memory. > Also, since that time, no other RDD ever gets cached (the worker always > reports something like "WARN CacheManager: Not enough space to cache > partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if > rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that > all executors have 0MB memory used (which is consistent with the CacheManager > warning). > When doing the same in scala, everything works perfectly. > I understand that this is a vague description, but I do no know how to > describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3731: -- Affects Version/s: (was: 1.0.2) 1.2.0 > RDD caching stops working in pyspark after some time > > > Key: SPARK-3731 > URL: https://issues.apache.org/jira/browse/SPARK-3731 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, 32bit, both in local mode or in standalone > cluster mode >Reporter: Milan Straka >Assignee: Davies Liu >Priority: Critical > Attachments: spark-3731.log, spark-3731.py, spark-3731.txt.bz2, > worker.log > > > Consider a file F which when loaded with sc.textFile and cached takes up > slightly more than half of free memory for RDD cache. > When in PySpark the following is executed: > 1) a = sc.textFile(F) > 2) a.cache().count() > 3) b = sc.textFile(F) > 4) b.cache().count() > and then the following is repeated (for example 10 times): > a) a.unpersist().cache().count() > b) b.unpersist().cache().count() > after some time, there are no RDD cached in memory. > Also, since that time, no other RDD ever gets cached (the worker always > reports something like "WARN CacheManager: Not enough space to cache > partition rdd_23_5 in memory! Free memory is 277478190 bytes.", even if > rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that > all executors have 0MB memory used (which is consistent with the CacheManager > warning). > When doing the same in scala, everything works perfectly. > I understand that this is a vague description, but I do no know how to > describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3762) clear all SparkEnv references after stop
[ https://issues.apache.org/jira/browse/SPARK-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3762. -- Resolution: Fixed Fix Version/s: 1.2.0 > clear all SparkEnv references after stop > > > Key: SPARK-3762 > URL: https://issues.apache.org/jira/browse/SPARK-3762 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.0 > > > SparkEnv is cached in ThreadLocal object, so after stop and create a new > SparkContext, old SparkEnv is still used by some threads, it will trigger > many problems. > We should clear all the references after stop a SparkEnv. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sotos Matzanas closed SPARK-3732. - Resolution: Won't Fix > Yarn Client: Add option to NOT System.exit() at end of main() > - > > Key: SPARK-3732 > URL: https://issues.apache.org/jira/browse/SPARK-3732 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: Sotos Matzanas > Original Estimate: 1h > Remaining Estimate: 1h > > We would like to add the ability to create and submit Spark jobs > programmatically via Scala/Java. We have found a way to hack this and submit > jobs via Yarn, but since > org.apache.spark.deploy.yarn.Client.main() > exits with either 0 or 1 in the end, this will mean exit of our own program. > We would like to add an optional spark conf param to NOT exit at the end of > the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3808) PySpark fails to start in Windows
[ https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162310#comment-14162310 ] Andrew Or commented on SPARK-3808: -- Hey [~tsudukim] can you verify that pyspark, spark-shell and spark-submit work as expected in Windows now? > PySpark fails to start in Windows > - > > Key: SPARK-3808 > URL: https://issues.apache.org/jira/browse/SPARK-3808 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 1.2.0 > Environment: Windows >Reporter: Masayoshi TSUZUKI >Assignee: Masayoshi TSUZUKI >Priority: Blocker > Fix For: 1.2.0 > > > When we execute bin\pyspark.cmd in Windows, it fails to start. > We get following messages. > {noformat} > C:\>bin\pyspark.cmd > Running C:\\python.exe with > PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python; > Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. > ="x" was unexpected at this time. > Traceback (most recent call last): > File "C:\\bin\..\python\pyspark\shell.py", line 45, in > sc = SparkContext(appName="PySparkShell", pyFiles=add_files) > File "C:\\python\pyspark\context.py", line 103, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway > raise Exception(error_msg) > Exception: Launching GatewayServer failed with exit code 255! > Warning: Expected GatewayServer to output a port, but found no output. > >>> > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Tkachenko closed SPARK-3761. - Resolution: Fixed > Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 > - > > Key: SPARK-3761 > URL: https://issues.apache.org/jira/browse/SPARK-3761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Igor Tkachenko >Priority: Critical > > I have Scala code: > val master = "spark://:7077" > val sc = new SparkContext(new SparkConf() > .setMaster(master) > .setAppName("SparkQueryDemo 01") > .set("spark.executor.memory", "512m")) > val count2 = sc .textFile("hdfs:// address>:8020/tmp/data/risk/account.txt") > .filter(line => line.contains("Word")) > .count() > I've got such an error: > [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0.0:0 failed 4 times, most > recent failure: Exception failure in TID 6 on host : > java.lang.ClassNotFoundExcept > ion: SimpleApp$$anonfun$1 > My dependencies : > object Version { > val spark= "1.0.0-cdh5.1.0" > } > object Library { > val sparkCore = "org.apache.spark" % "spark-assembly_2.10" % > Version.spark > } > My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3808) PySpark fails to start in Windows
[ https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3808. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Masayoshi TSUZUKI Target Version/s: 1.2.0 > PySpark fails to start in Windows > - > > Key: SPARK-3808 > URL: https://issues.apache.org/jira/browse/SPARK-3808 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 1.2.0 > Environment: Windows >Reporter: Masayoshi TSUZUKI >Assignee: Masayoshi TSUZUKI >Priority: Blocker > Fix For: 1.2.0 > > > When we execute bin\pyspark.cmd in Windows, it fails to start. > We get following messages. > {noformat} > C:\>bin\pyspark.cmd > Running C:\\python.exe with > PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python; > Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. > ="x" was unexpected at this time. > Traceback (most recent call last): > File "C:\\bin\..\python\pyspark\shell.py", line 45, in > sc = SparkContext(appName="PySparkShell", pyFiles=add_files) > File "C:\\python\pyspark\context.py", line 103, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway > raise Exception(error_msg) > Exception: Launching GatewayServer failed with exit code 255! > Warning: Expected GatewayServer to output a port, but found no output. > >>> > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3761) Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4
[ https://issues.apache.org/jira/browse/SPARK-3761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162307#comment-14162307 ] Igor Tkachenko commented on SPARK-3761: --- After I've added line sc.addJar("") it works for sbt (but does not for maven project, happily it is enough for me). We can close the issue. > Class anonfun$1 not found exception / sbt 13.x / Scala 2.10.4 > - > > Key: SPARK-3761 > URL: https://issues.apache.org/jira/browse/SPARK-3761 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Igor Tkachenko >Priority: Critical > > I have Scala code: > val master = "spark://:7077" > val sc = new SparkContext(new SparkConf() > .setMaster(master) > .setAppName("SparkQueryDemo 01") > .set("spark.executor.memory", "512m")) > val count2 = sc .textFile("hdfs:// address>:8020/tmp/data/risk/account.txt") > .filter(line => line.contains("Word")) > .count() > I've got such an error: > [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0.0:0 failed 4 times, most > recent failure: Exception failure in TID 6 on host : > java.lang.ClassNotFoundExcept > ion: SimpleApp$$anonfun$1 > My dependencies : > object Version { > val spark= "1.0.0-cdh5.1.0" > } > object Library { > val sparkCore = "org.apache.spark" % "spark-assembly_2.10" % > Version.spark > } > My OS is Win 7, sbt 13.5, Scala 2.10.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3808) PySpark fails to start in Windows
[ https://issues.apache.org/jira/browse/SPARK-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3808: - Affects Version/s: (was: 1.1.0) 1.2.0 > PySpark fails to start in Windows > - > > Key: SPARK-3808 > URL: https://issues.apache.org/jira/browse/SPARK-3808 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 1.2.0 > Environment: Windows >Reporter: Masayoshi TSUZUKI >Priority: Blocker > > When we execute bin\pyspark.cmd in Windows, it fails to start. > We get following messages. > {noformat} > C:\>bin\pyspark.cmd > Running C:\\python.exe with > PYTHONPATH=C:\\bin\..\python\lib\py4j-0.8.2.1-src.zip;C:\\bin\..\python; > Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. > ="x" was unexpected at this time. > Traceback (most recent call last): > File "C:\\bin\..\python\pyspark\shell.py", line 45, in > sc = SparkContext(appName="PySparkShell", pyFiles=add_files) > File "C:\\python\pyspark\context.py", line 103, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "C:\\python\pyspark\context.py", line 212, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "C:\\python\pyspark\java_gateway.py", line 71, in launch_gateway > raise Exception(error_msg) > Exception: Launching GatewayServer failed with exit code 255! > Warning: Expected GatewayServer to output a port, but found no output. > >>> > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3297) [Spark SQL][UI] SchemaRDD toString with many columns messes up Storage tab display
[ https://issues.apache.org/jira/browse/SPARK-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3297. --- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Hossein Falaki Fixed by https://github.com/apache/spark/pull/2687 > [Spark SQL][UI] SchemaRDD toString with many columns messes up Storage tab > display > -- > > Key: SPARK-3297 > URL: https://issues.apache.org/jira/browse/SPARK-3297 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 1.0.2 >Reporter: Evan Chan >Assignee: Hossein Falaki >Priority: Minor > Labels: newbie > Fix For: 1.1.1, 1.2.0 > > > When a SchemaRDD with many columns (for example, 57 columns in this example) > is cached using sqlContext.cacheTable, the Storage tab of the driver Web UI > display gets messed up, because the long string of the SchemaRDD causes the > first column to be much much wider than the others, and in fact much wider > than the width of the browser. It would be nice to have the first column be > restricted to, say, 50% of the width of the browser window, with some minimum. > For example this is the SchemaRDD text for my table: > RDD Storage Info for ExistingRdd > [ActionGeo_ADM1Code#198,ActionGeo_CountryCode#199,ActionGeo_FeatureID#200,ActionGeo_FullName#201,ActionGeo_Lat#202,ActionGeo_Long#203,ActionGeo_Type#204,Actor1Code#205,Actor1CountryCode#206,Actor1EthnicCode#207,Actor1Geo_ADM1Code#208,Actor1Geo_CountryCode#209,Actor1Geo_FeatureID#210,Actor1Geo_FullName#211,Actor1Geo_Lat#212,Actor1Geo_Long#213,Actor1Geo_Type#214,Actor1KnownGroupCode#215,Actor1Name#216,Actor1Religion1Code#217,Actor1Religion2Code#218,Actor1Type1Code#219,Actor1Type2Code#220,Actor1Type3Code#221,Actor2Code#222,Actor2CountryCode#223,Actor2EthnicCode#224,Actor2Geo_ADM1Code#225,Actor2Geo_CountryCode#226,Actor2Geo_FeatureID#227,Actor2Geo_FullName#228,Actor2Geo_Lat#229,Actor2Geo_Long#230,Actor2Geo_Type#231,Actor2KnownGroupCode#232,Actor2Name#233,Actor2Religion1Code#234,Actor2Religion2Code#235,Actor2Type1Code#236,Actor2Type2Code#237,Actor2Type3Code#238,AvgTone#239,DATEADDED#240,Day#241,EventBaseCode#242,EventCode#243,EventId#244,EventRootCode#245,FractionDate#246,GoldsteinScale#247,IsRootEvent#248,MonthYear#249,NumArticles#250,NumMentions#251,NumSources#252,QuadClass#253,Year#254], > MappedRDD[200] > I would personally love to fix the toString method to not necessarily print > every column, but to cut it off after a while. This would aid the printout > in the Spark Shell as well. For example: > [ActionGeo_ADM1Code#198,ActionGeo_CountryCode#199,ActionGeo_FeatureID#200,ActionGeo_FullName#201,ActionGeo_Lat#202 > and 52 more columns] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2915) Storage summary table UI glitch when using sparkSQL
[ https://issues.apache.org/jira/browse/SPARK-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2915. --- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Hossein Falaki Fixed by https://github.com/apache/spark/pull/2687 > Storage summary table UI glitch when using sparkSQL > --- > > Key: SPARK-2915 > URL: https://issues.apache.org/jira/browse/SPARK-2915 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.2 > Environment: Standalone >Reporter: Hossein Falaki >Assignee: Hossein Falaki >Priority: Minor > Labels: WebUI > Fix For: 1.1.1, 1.2.0 > > > When using sqlContext.cacheTable() a registered table. the name of the RDD > becomes a very large string (related to the query that created the sqlRDD). > As a result the first columns of the storage tab in SparkUI becomes very long > and the other columns become squashed. > Since the name of the RDD is not human readable, we can simply set ellipsis > in the first cell (which will hide the rest of string). Alternatively we can > fix the RDD name to a more readable and shorter name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3827) Very long RDD names are not rendered properly in web UI
[ https://issues.apache.org/jira/browse/SPARK-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3827. --- Resolution: Fixed Fix Version/s: 1.1.1 Issue resolved by pull request 2687 [https://github.com/apache/spark/pull/2687] > Very long RDD names are not rendered properly in web UI > --- > > Key: SPARK-3827 > URL: https://issues.apache.org/jira/browse/SPARK-3827 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.0 >Reporter: Hossein Falaki >Priority: Minor > Fix For: 1.2.0, 1.1.1 > > > With Spark SQL we generate very long RDD names. These names are not properly > rendered in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3797) Run the shuffle service inside the YARN NodeManager as an AuxiliaryService
[ https://issues.apache.org/jira/browse/SPARK-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162254#comment-14162254 ] Andrew Or commented on SPARK-3797: -- Thanks for detailing the considerations Sandy. I agree with every single one of the drawbacks you listed. The alternative of launching the shuffle service inside containers has been given much thought. However, it will be overkill if we allocate one such service for each executor or even application. In general, these services are intended to be long-running local resource managers that are really more suited to be run per-node. As you suggested, these services tend to have low memory requirements and would be forced to take up more than what is needed. For the rolling upgrades point, we can add some logic as in MR to handle short outages as Tom suggested. The dependency and deployment stories are a little harder to workaround. I think the point here is that either way we need to offer an alternative of running it independently of the NM in case the cluster has conflicting dependencies. Perhaps we'll need some `start-shuffle-service.sh` script to launch these containers on all nodes before running any actual Spark application. I should note that our shuffle service is intended to be fairly lightweight and will have very limited dependencies (e.g. we are considering building it with Java because we don't want to bundle Scala). Hopefully that mitigates the issue. > Run the shuffle service inside the YARN NodeManager as an AuxiliaryService > -- > > Key: SPARK-3797 > URL: https://issues.apache.org/jira/browse/SPARK-3797 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Patrick Wendell >Assignee: Andrew Or > > It's also worth considering running the shuffle service in a YARN container > beside the executor(s) on each node. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3819) Jenkins should compile Spark against multiple versions of Hadoop
[ https://issues.apache.org/jira/browse/SPARK-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162195#comment-14162195 ] Matt Cheah commented on SPARK-3819: --- Can you elaborate as to why it is not feasible to build against multiple Hadoop versions? Is it simply because it is too slow? I still strongly stand by the idea of making the need to test building against multiple versions explicit to the contributor. We need to minimize the risk of breaking the build. > Jenkins should compile Spark against multiple versions of Hadoop > > > Key: SPARK-3819 > URL: https://issues.apache.org/jira/browse/SPARK-3819 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.1.0 >Reporter: Matt Cheah >Priority: Minor > Labels: Jenkins > Fix For: 1.1.1 > > > The build broke because of PR > https://github.com/apache/spark/pull/2609#issuecomment-57962393 - however the > build failure was not caught by Jenkins. From what I understand the build > failure occurs when Spark is built manually against certain versions of > Hadoop. > It seems intuitive that Jenkins should catch this sort of thing. The code > should be compiled against multiple Hadoop versions. It seems like overkill > to run the full test suite against all Hadoop versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0
[ https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-1297: -- Attachment: spark-1297-v6.txt Patch v6 uses 0.98.5 hbase release. > Upgrade HBase dependency to 0.98.0 > -- > > Key: SPARK-1297 > URL: https://issues.apache.org/jira/browse/SPARK-1297 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Minor > Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, > spark-1297-v5.txt, spark-1297-v6.txt > > > HBase 0.94.6 was released 11 months ago. > Upgrade HBase dependency to 0.98.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162156#comment-14162156 ] Xiangrui Meng commented on SPARK-3434: -- [~shivaram] and [~ConcreteVitamin] Any updates on the design doc and prototype? > Distributed block matrix > > > Key: SPARK-3434 > URL: https://issues.apache.org/jira/browse/SPARK-3434 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > > This JIRA is for discussing distributed matrices stored in block > sub-matrices. The main challenge is the partitioning scheme to allow adding > linear algebra operations in the future, e.g.: > 1. matrix multiplication > 2. matrix factorization (QR, LU, ...) > Let's discuss the partitioning and storage and how they fit into the above > use cases. > Questions: > 1. Should it be backed by a single RDD that contains all of the sub-matrices > or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org