[jira] [Commented] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
[ https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375477#comment-14375477 ] Apache Spark commented on SPARK-6397: - User 'watermen' has created a pull request for this issue: https://github.com/apache/spark/pull/5132 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis Key: SPARK-6397 URL: https://issues.apache.org/jira/browse/SPARK-6397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yadong Qi Priority: Minor Currently, some LogicalPlans do not override missingInput, but they should. Then, the lack of proper missingInput implementations leaks to CheckAnalysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375508#comment-14375508 ] Littlestar commented on SPARK-6461: --- http://spark.apache.org/docs/latest/running-on-mesos.html I run ok with ./bin/spark-shell --master mesos://host:5050 but failed with run-example SparkPi does spark 1.3.0 tested SparkPi on mesos, thanks spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6213) sql.catalyst.expressions.Expression is not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-6213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375612#comment-14375612 ] Littlestar commented on SPARK-6213: --- may be change protected[sql] def selectFilters(filters: Seq[Expression]) to static public java class is a easy way sql.catalyst.expressions.Expression is not friendly to java --- Key: SPARK-6213 URL: https://issues.apache.org/jira/browse/SPARK-6213 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Littlestar Priority: Minor sql.sources.CatalystScan# public RDDRow buildScan(SeqAttribute requiredColumns, SeqExpression filters) I use java to extends BaseRelation, but sql.catalyst.expressions.Expression is not friendly to java, it's can't iterated by java, such as NodeName, NodeType, FuncName, FuncArgs. DataSourceStrategy.scala#selectFilters {noformat} /** * Selects Catalyst predicate [[Expression]]s which are convertible into data source [[Filter]]s, * and convert them. */ protected[sql] def selectFilters(filters: Seq[Expression]) = { def translate(predicate: Expression): Option[Filter] = predicate match { case expressions.EqualTo(a: Attribute, Literal(v, _)) = Some(sources.EqualTo(a.name, v)) case expressions.EqualTo(Literal(v, _), a: Attribute) = Some(sources.EqualTo(a.name, v)) case expressions.GreaterThan(a: Attribute, Literal(v, _)) = Some(sources.GreaterThan(a.name, v)) case expressions.GreaterThan(Literal(v, _), a: Attribute) = Some(sources.LessThan(a.name, v)) case expressions.LessThan(a: Attribute, Literal(v, _)) = Some(sources.LessThan(a.name, v)) case expressions.LessThan(Literal(v, _), a: Attribute) = Some(sources.GreaterThan(a.name, v)) case expressions.GreaterThanOrEqual(a: Attribute, Literal(v, _)) = Some(sources.GreaterThanOrEqual(a.name, v)) case expressions.GreaterThanOrEqual(Literal(v, _), a: Attribute) = Some(sources.LessThanOrEqual(a.name, v)) case expressions.LessThanOrEqual(a: Attribute, Literal(v, _)) = Some(sources.LessThanOrEqual(a.name, v)) case expressions.LessThanOrEqual(Literal(v, _), a: Attribute) = Some(sources.GreaterThanOrEqual(a.name, v)) case expressions.InSet(a: Attribute, set) = Some(sources.In(a.name, set.toArray)) case expressions.IsNull(a: Attribute) = Some(sources.IsNull(a.name)) case expressions.IsNotNull(a: Attribute) = Some(sources.IsNotNull(a.name)) case expressions.And(left, right) = (translate(left) ++ translate(right)).reduceOption(sources.And) case expressions.Or(left, right) = for { leftFilter - translate(left) rightFilter - translate(right) } yield sources.Or(leftFilter, rightFilter) case expressions.Not(child) = translate(child).map(sources.Not) case _ = None } filters.flatMap(translate) } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
[ https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-6461: -- Comment: was deleted (was: in spark/bin, some shell script use usr/bin/env bash I think changed #!/usr/bin/env bash to #!/bin/bash and that worked.) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos -- Key: SPARK-6461 URL: https://issues.apache.org/jira/browse/SPARK-6461 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.3.0 Reporter: Littlestar I use mesos run spak 1.3.0 ./run-example SparkPi but failed. spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos spark.executorEnv.PATH spark.executorEnv.HADOOP_HOME spark.executorEnv.JAVA_HOME E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs -copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz' sh: hadoop: command not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376579#comment-14376579 ] Andrew Drozdov commented on SPARK-6474: --- Great, and thanks. Taking a look now. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6474: Issue Type: Improvement (was: Bug) Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6477) Run MIMA tests before the Spark test suite
[ https://issues.apache.org/jira/browse/SPARK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brennon York updated SPARK-6477: Issue Type: Improvement (was: Bug) Run MIMA tests before the Spark test suite -- Key: SPARK-6477 URL: https://issues.apache.org/jira/browse/SPARK-6477 Project: Spark Issue Type: Improvement Components: Build Reporter: Brennon York Priority: Minor Right now the MIMA tests are the last thing to run, yet run very quickly and, if they fail, didn't need the entire Spark test suite to have completed first. I propose we move the MIMA tests to run before the full Spark suite so that builds that fail the MIMA checks will return much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376584#comment-14376584 ] Nicholas Chammas commented on SPARK-6474: - This change also fits the pattern of [{{request_spot_instances()}}|https://github.com/apache/spark/blob/474d1320c9b93c501710ad1cfa836b8284562a2c/ec2/spark_ec2.py#L542], which is called on the connection like {{run_instances()}} as opposed to on an {{Image}}. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
Rares Vernica created SPARK-6476: Summary: Spark fileserver not started on same IP as using spark.driver.host Key: SPARK-6476 URL: https://issues.apache.org/jira/browse/SPARK-6476 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Reporter: Rares Vernica I initially inquired about this here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3ccalq9kxcn2mwfnd4r4k0q+qh1ypwn3p8rgud1v6yrx9_05lv...@mail.gmail.com%3E If the Spark driver host has multiple IPs and spark.driver.host is set to one of them, I would expect the fileserver to start on the same IP. I checked HttpServer and the jetty Server is started the default IP of the machine: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala#L75 Something like this might work instead: {code:title=HttpServer.scala#L75} val server = new Server(new InetSocketAddress(conf.get(spark.driver.host), 0)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6477) Run MIMA tests before the Spark test suite
Brennon York created SPARK-6477: --- Summary: Run MIMA tests before the Spark test suite Key: SPARK-6477 URL: https://issues.apache.org/jira/browse/SPARK-6477 Project: Spark Issue Type: Bug Components: Build Reporter: Brennon York Priority: Minor Right now the MIMA tests are the last thing to run, yet run very quickly and, if they fail, didn't need the entire Spark test suite to have completed first. I propose we move the MIMA tests to run before the full Spark suite so that builds that fail the MIMA checks will return much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-6122: I reverted this because it looks like it was responsible for some testing failures due to the dependency changes. Upgrade Tachyon dependency to 0.6.0 --- Key: SPARK-6122 URL: https://issues.apache.org/jira/browse/SPARK-6122 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Haoyuan Li Assignee: Calvin Jia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6308) VectorUDT is displayed as `vecto` in dtypes
[ https://issues.apache.org/jira/browse/SPARK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6308. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5118 [https://github.com/apache/spark/pull/5118] VectorUDT is displayed as `vecto` in dtypes --- Key: SPARK-6308 URL: https://issues.apache.org/jira/browse/SPARK-6308 Project: Spark Issue Type: Bug Components: MLlib, SQL Reporter: Xiangrui Meng Assignee: Manoj Kumar Fix For: 1.4.0 VectorUDT should override simpleString instead of relying on the default implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Drozdov updated SPARK-6474: -- Summary: Replace image.run with connection.run_instances in spark_ec2.py (was: Replace image.run with connection.run_instances) Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6474) Replace image.run with connection.run_instances
Andrew Drozdov created SPARK-6474: - Summary: Replace image.run with connection.run_instances Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6474: Priority: Minor (was: Major) Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572 ] Nicholas Chammas edited comment on SPARK-6474 at 3/23/15 8:29 PM: -- LGTM. Just setting the Priority to Minor since this doesn't cause any major problems, though it should be fixed. was (Author: nchammas): LGTM. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6473) Launcher lib shouldn't try to figure out Scala version when not in dev mode
[ https://issues.apache.org/jira/browse/SPARK-6473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376569#comment-14376569 ] Apache Spark commented on SPARK-6473: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5143 Launcher lib shouldn't try to figure out Scala version when not in dev mode --- Key: SPARK-6473 URL: https://issues.apache.org/jira/browse/SPARK-6473 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Fix For: 1.4.0 Thanks to [~nravi] for pointing this out. The launcher library currently always tries to figure out what's the build's scala version, even when it's not needed. That code is only used when setting some dev options, and relies on the layout of the build directories, so it doesn't work with the directory layout created by make-distribution.sh. Right now this works on Linux because bin/load-spark-env.sh sets the Scala version explicitly, but it would break the distribution on Windows, for example. Fix is pretty straight-forward. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572 ] Nicholas Chammas commented on SPARK-6474: - LGTM. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.
Xiangrui Meng created SPARK-6475: Summary: DataFrame should support array types when creating DFs from JavaBeans. Key: SPARK-6475 URL: https://issues.apache.org/jira/browse/SPARK-6475 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Reporter: Xiangrui Meng If we have a JavaBean class with array fields, SQL throws an exception in `createDataFrame` because arrays are not matched in `getSchema` from a JavaBean class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.
[ https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-6475: Assignee: Xiangrui Meng DataFrame should support array types when creating DFs from JavaBeans. -- Key: SPARK-6475 URL: https://issues.apache.org/jira/browse/SPARK-6475 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng If we have a JavaBean class with array fields, SQL throws an exception in `createDataFrame` because arrays are not matched in `getSchema` from a JavaBean class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.
[ https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376780#comment-14376780 ] Apache Spark commented on SPARK-6475: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5146 DataFrame should support array types when creating DFs from JavaBeans. -- Key: SPARK-6475 URL: https://issues.apache.org/jira/browse/SPARK-6475 Project: Spark Issue Type: Improvement Components: DataFrame, SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng If we have a JavaBean class with array fields, SQL throws an exception in `createDataFrame` because arrays are not matched in `getSchema` from a JavaBean class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376234#comment-14376234 ] Apache Spark commented on SPARK-6369: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5139 InsertIntoHiveTable should use logic from SparkHadoopWriter --- Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4848) On a stand-alone cluster, several worker-specific variables are read only on the master
[ https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376239#comment-14376239 ] Apache Spark commented on SPARK-4848: - User 'nkronenfeld' has created a pull request for this issue: https://github.com/apache/spark/pull/5140 On a stand-alone cluster, several worker-specific variables are read only on the master --- Key: SPARK-4848 URL: https://issues.apache.org/jira/browse/SPARK-4848 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: stand-alone spark cluster Reporter: Nathan Kronenfeld Original Estimate: 24h Remaining Estimate: 24h On a stand-alone spark cluster, much of the determination of worker specifics, especially one has multiple instances per node, is done only on the master. The master loops over instances, and starts a worker per instance on each node. This means, if your workers have different values of SPARK_WORKER_INSTANCES or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values are ignored except the one on the master. SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm not sure how it will behave, since all instances will read the same value from the environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5954) Add topByKey to pair RDDs
[ https://issues.apache.org/jira/browse/SPARK-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376215#comment-14376215 ] Xiangrui Meng commented on SPARK-5954: -- Note: We added topByKey in mllib.rdd.MLPairRDDFunctions instead of in Core. Add topByKey to pair RDDs - Key: SPARK-5954 URL: https://issues.apache.org/jira/browse/SPARK-5954 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Xiangrui Meng Assignee: Shuo Xiang Fix For: 1.4.0 `topByKey(num: Int): RDD[(K, V)]` finds the top-k values for each key in a pair RDD. This is used, e.g., in computing top recommendations. We can use the Guava implementation of finding top-k from an iterator. See also https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Utils.scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6470) Allow Spark apps to put YARN node labels in their requests
[ https://issues.apache.org/jira/browse/SPARK-6470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-6470: - Assignee: Sandy Ryza Allow Spark apps to put YARN node labels in their requests -- Key: SPARK-6470 URL: https://issues.apache.org/jira/browse/SPARK-6470 Project: Spark Issue Type: Improvement Components: YARN Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376311#comment-14376311 ] Apache Spark commented on SPARK-6471: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/5141 Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Fix For: 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376361#comment-14376361 ] Zhan Zhang commented on SPARK-3720: --- [~iward] Since this jiar is duplicated to Spark-2883, we can move the discussion to that one. In short, Sparkl-2883 wants to support saveAsOrcFile and orcFile similar to parquet file support in spark. But due to the underlying api change, the patch is delayed. support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: Fei Wang Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2331: - Priority: Minor (was: Major) Target Version/s: 2+ SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Ian Hummel Priority: Minor The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-2331: -- Reopening to catch this for version 2.x SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] -- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4830) Spark Streaming Java Application : java.lang.ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376335#comment-14376335 ] sam commented on SPARK-4830: I've been seeing a similar problem but just for a regular Spark job (no streaming). The executor logs also contain a storm of other exceptions (lots of IO and connection exceptions), all seem unrelated to each other. My suspicion is that this is caused by https://issues.apache.org/jira/browse/SPARK-3768 (also read linked JIRAs) and the storm of nonsense exceptions is because YARN is killing off the executors using some heavy kill signals - the JVMs are mid process and cannot cope gracefully. A workaround appears to manually set spark.yarn.executor.memoryOverhead to something much larger than the default (~400 MB). I'm pretty sure I've seen a correspondence with the number of partitions I use and the chances of seeing this kind of thing. Anyway I've had to set my overhead to 12GB, this seems a little large right? Anything lower and I see a mess of exceptions in my worker logs. 15/03/23 18:08:13 ERROR BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: com.xxx.Main$XxxXxx$3 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:59) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:437) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:359) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at
[jira] [Updated] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-6471: -- Summary: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns (was: Metastoreschema should only be a subset of parquetSchema to support dropping of columns using replace columns) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Fix For: 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6471) Metastoreschema should only be a subset of parquetSchema to support dropping of columns using replace columns
Yash Datta created SPARK-6471: - Summary: Metastoreschema should only be a subset of parquetSchema to support dropping of columns using replace columns Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Fix For: 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4086) Fold-style aggregation for VertexRDD
[ https://issues.apache.org/jira/browse/SPARK-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376341#comment-14376341 ] Apache Spark commented on SPARK-4086: - User 'brennonyork' has created a pull request for this issue: https://github.com/apache/spark/pull/5142 Fold-style aggregation for VertexRDD Key: SPARK-4086 URL: https://issues.apache.org/jira/browse/SPARK-4086 Project: Spark Issue Type: Improvement Components: GraphX Reporter: Ankur Dave VertexRDD currently supports creations and joins only through a reduce-style interface where the user specifies how to merge two conflicting values. We should also support a fold-style interface that takes a default value (possibly of a different type than the input collection) and a fold function specifying how to accumulate values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN
[ https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376409#comment-14376409 ] Sean Owen commented on SPARK-6469: -- [~preaudc] would you like to open a small PR to add a note about this? The YARN driver in yarn-client mode will not use the local directories configured for YARN --- Key: SPARK-6469 URL: https://issues.apache.org/jira/browse/SPARK-6469 Project: Spark Issue Type: Documentation Components: YARN Reporter: Christophe PRÉAUD Priority: Minor Attachments: TestYarnVars.scala According to the [Spark YARN doc page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], Spark executors will use the local directories configured for YARN, not {{spark.local.dir}} which should be ignored. However it should be noted that in yarn-client mode, though the executors will indeed use the local directories configured for YARN, the driver will not, because it is not running on the YARN cluster; the driver in yarn-client will use the local directories defined in {{spark.local.dir}} Can this please be clarified in the Spark YARN documentation above? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376286#comment-14376286 ] Yash Datta commented on SPARK-6471: --- https://github.com/apache/spark/pull/5141 Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Fix For: 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6471) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
[ https://issues.apache.org/jira/browse/SPARK-6471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-6471: -- Comment: was deleted (was: https://github.com/apache/spark/pull/5141) Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns --- Key: SPARK-6471 URL: https://issues.apache.org/jira/browse/SPARK-6471 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yash Datta Fix For: 1.4.0 Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376329#comment-14376329 ] Manoj Kumar commented on SPARK-6192: [~mengxr] Sorry for spamming, but do you have anything else to add? (The deadline is a few days away, hence) Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases
Sean Owen created SPARK-6480: Summary: histogram() bucket function is wrong in some simple edge cases Key: SPARK-6480 URL: https://issues.apache.org/jira/browse/SPARK-6480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen (Credit to a customer report here) This test would fail now: {code} val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) {code} Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max. Fairly plausible use case actually. This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer: {code} val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11) {code} (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5508: Priority: Critical (was: Major) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API --- Key: SPARK-5508 URL: https://issues.apache.org/jira/browse/SPARK-5508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: mesos, cdh Reporter: Ayoub Benali Assignee: Cheng Lian Priority: Critical Labels: hivecontext, parquet *The root cause of this bug is explained below ([see here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12771559commentId=14368505]).* *The workaround of this issue is to set the following confs* {code} sql(set spark.sql.parquet.useDataSourceApi=false) sql(set spark.sql.hive.convertMetastoreParquet=false) {code} *Below is the original description.* When the table is saved as parquet, we cannot query a field which is an array of struct after an INSERT statement, like show bellow: {noformat} scala val data1={ | timestamp: 1422435598, | data_array: [ | { | field1: 1, | field2: 2 | } | ] | } scala val data2={ | timestamp: 1422435598, | data_array: [ | { | field1: 3, | field2: 4 | } | ] scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val rdd = hiveContext.jsonRDD(jsonRDD) scala rdd.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) |-- timestamp: integer (nullable = true) scala rdd.registerTempTable(tmp_table) scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(SET hive.exec.dynamic.partition = true) scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict) scala hiveContext.sql(set parquet.compression=GZIP) scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true) scala hiveContext.sql(create external table if not exists persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table') scala hiveContext.sql(insert into table persisted_table select * from tmp_table).collect scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://*/test_table/part-1 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at
[jira] [Updated] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API
[ https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5508: Assignee: Cheng Lian Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API --- Key: SPARK-5508 URL: https://issues.apache.org/jira/browse/SPARK-5508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Environment: mesos, cdh Reporter: Ayoub Benali Assignee: Cheng Lian Labels: hivecontext, parquet *The root cause of this bug is explained below ([see here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12771559commentId=14368505]).* *The workaround of this issue is to set the following confs* {code} sql(set spark.sql.parquet.useDataSourceApi=false) sql(set spark.sql.hive.convertMetastoreParquet=false) {code} *Below is the original description.* When the table is saved as parquet, we cannot query a field which is an array of struct after an INSERT statement, like show bellow: {noformat} scala val data1={ | timestamp: 1422435598, | data_array: [ | { | field1: 1, | field2: 2 | } | ] | } scala val data2={ | timestamp: 1422435598, | data_array: [ | { | field1: 3, | field2: 4 | } | ] scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) scala val rdd = hiveContext.jsonRDD(jsonRDD) scala rdd.printSchema root |-- data_array: array (nullable = true) ||-- element: struct (containsNull = false) |||-- field1: integer (nullable = true) |||-- field2: integer (nullable = true) |-- timestamp: integer (nullable = true) scala rdd.registerTempTable(tmp_table) scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect res3: Array[org.apache.spark.sql.Row] = Array([1], [3]) scala hiveContext.sql(SET hive.exec.dynamic.partition = true) scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict) scala hiveContext.sql(set parquet.compression=GZIP) scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true) scala hiveContext.sql(create external table if not exists persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table') scala hiveContext.sql(insert into table persisted_table select * from tmp_table).collect scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW explode(data_array) nestedStuff AS data).collect parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://*/test_table/part-1 at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56)
[jira] [Updated] (SPARK-6437) SQL ExternalSort should use CompletionIterator to clean up temp files
[ https://issues.apache.org/jira/browse/SPARK-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6437: Assignee: Michael Armbrust SQL ExternalSort should use CompletionIterator to clean up temp files - Key: SPARK-6437 URL: https://issues.apache.org/jira/browse/SPARK-6437 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Michael Armbrust Priority: Critical Right now, temp files used by SQL ExternalSort are not cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6124) Support jdbc connection properties in OPTIONS part of the query
[ https://issues.apache.org/jira/browse/SPARK-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6124. - Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Issue resolved by pull request 4859 [https://github.com/apache/spark/pull/4859] Support jdbc connection properties in OPTIONS part of the query --- Key: SPARK-6124 URL: https://issues.apache.org/jira/browse/SPARK-6124 Project: Spark Issue Type: Improvement Components: SQL Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Priority: Minor Fix For: 1.3.1, 1.4.0 We would like to make it possible to specify connection properties in OPTIONS part of the sql query that uses jdbc driver. For example: CREATE TEMPORARY TABLE abc USING org.apache.spark.sql.jdbc OPTIONS (url '$url', dbtable '(SELECT _ROWID_ FROM test.people)', user 'testUser', password 'testPass') To do this, we will do minor changes in JDBCRelation and JDBCRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6112: -- Attachment: SparkOffheapsupportbyHDFS.pdf Design doc for hdfs offheap support Provide OffHeap support through HDFS RAM_DISK - Key: SPARK-6112 URL: https://issues.apache.org/jira/browse/SPARK-6112 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Zhan Zhang Attachments: SparkOffheapsupportbyHDFS.pdf HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs RAM_DISK feature, if the user environment does not have tachyon deployed. With this feature, it potentially provides possibility to share RDD in memory across different jobs and even share with jobs other than spark, and avoid the RDD recomputation if executors crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhan Zhang updated SPARK-6479: -- Attachment: SparkOffheapsupportbyHDFS.pdf The design doc also includes stuff from SPARK-6112 Create off-heap block storage API (internal) Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Attachments: SparkOffheapsupportbyHDFS.pdf Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6054) SQL UDF returning object of case class; regression from 1.2.0
[ https://issues.apache.org/jira/browse/SPARK-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-6054: --- Assignee: Michael Armbrust SQL UDF returning object of case class; regression from 1.2.0 - Key: SPARK-6054 URL: https://issues.apache.org/jira/browse/SPARK-6054 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Windows 8, Scala 2.11.2, Spark 1.3.0 RC1 Reporter: Spiro Michaylov Assignee: Michael Armbrust Priority: Blocker The following code fails with a stack trace beginning with: {code} 15/02/26 23:21:32 ERROR Executor: Exception in task 2.0 in stage 7.0 (TID 422) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: scalaUDF(sales#2,discounts#3) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:237) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207) {code} Here is the 1.3.0 version of the code: {code} case class SalesDisc(sales: Double, discounts: Double) def makeStruct(sales: Double, disc:Double) = SalesDisc(sales, disc) sqlContext.udf.register(makeStruct, makeStruct _) val withStruct = sqlContext.sql(SELECT id, sd.sales FROM (SELECT id, makeStruct(sales, discounts) AS sd FROM customerTable) AS d) withStruct.foreach(println) {code} This used to work in 1.2.0. Interestingly, the following simplified version fails similarly, even though it seems to me to be VERY similar to the last test in the UDFSuite: {code} SELECT makeStruct(sales, discounts) AS sd FROM customerTable {code} The data table is defined thus: {code} val custs = Seq( Cust(1, Widget Co, 12.00, 0.00, AZ), Cust(2, Acme Widgets, 410500.00, 500.00, CA), Cust(3, Widgetry, 410500.00, 200.00, CA), Cust(4, Widgets R Us, 410500.00, 0.0, CA), Cust(5, Ye Olde Widgete, 500.00, 0.0, MA) ) val customerTable = sc.parallelize(custs, 4).toDF() customerTable.registerTempTable(customerTable) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases
[ https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376894#comment-14376894 ] Apache Spark commented on SPARK-6480: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/5148 histogram() bucket function is wrong in some simple edge cases -- Key: SPARK-6480 URL: https://issues.apache.org/jira/browse/SPARK-6480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen (Credit to a customer report here) This test would fail now: {code} val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) {code} Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max. Fairly plausible use case actually. This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer: {code} val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11) {code} (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4925) Publish Spark SQL hive-thriftserver maven artifact
[ https://issues.apache.org/jira/browse/SPARK-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4925: Assignee: Patrick Wendell Publish Spark SQL hive-thriftserver maven artifact --- Key: SPARK-4925 URL: https://issues.apache.org/jira/browse/SPARK-4925 Project: Spark Issue Type: Improvement Components: Build, SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Alex Liu Assignee: Patrick Wendell Priority: Critical The hive-thriftserver maven artifact is needed for integrating Spark SQL with Cassandra. Can we publish it to maven? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService
[ https://issues.apache.org/jira/browse/SPARK-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376907#comment-14376907 ] Jeffrey Turpin commented on SPARK-6373: --- Hey Aaron, Thanks for the feedback! I definitely agree we should find a common way to support both and your proposal sounds good to me. That being said, do you want me to take a cut at doing this, integrating the work I have already done (where applicable)? Definitely don't want to lose traction on this and would like to get it into a release sooner rather than later Let me know your thoughts... Cheers! Jeff Add SSL/TLS for the Netty based BlockTransferService - Key: SPARK-6373 URL: https://issues.apache.org/jira/browse/SPARK-6373 Project: Spark Issue Type: New Feature Components: Block Manager, Shuffle Affects Versions: 1.2.1 Reporter: Jeffrey Turpin Add the ability to allow for secure communications (SSL/TLS) for the Netty based BlockTransferService and the ExternalShuffleClient. This ticket will hopefully start the conversation around potential designs... Below is a reference to a WIP prototype which implements this functionality (prototype)... I have attempted to disrupt as little code as possible and tried to follow the current code structure (for the most part) in the areas I modified. I also studied how Hadoop achieves encrypted shuffle (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6481) Set In Progress when a PR is opened for an issue
Michael Armbrust created SPARK-6481: --- Summary: Set In Progress when a PR is opened for an issue Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow
zzc created SPARK-6483: -- Summary: Spark SQL udf(ScalaUdf) is very slow Key: SPARK-6483 URL: https://issues.apache.org/jira/browse/SPARK-6483 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.4.0 Environment: 1. Spark version is 1.3.0 2. 3 node per 80G/20C 3. read 250G parquet files from hdfs Reporter: zzc Test case: 1. register floor func with command: sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 group by chan, floor(ts), *it takes 17 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23500], [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS c2#23495L] Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) Aggregate true, [chan#23015,scalaUDF(ts#23016)], [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS PartialSum#23499L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at map at newParquet.scala:562 {quote} 2. run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 group by chan, (ts - ts % 300), *it takes only 5 minutes.* {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23349], [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS c2#23344L] Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54) Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map at newParquet.scala:562 {quote} 3. use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) *it takes only 5 minutes too. * {quote} == Physical Plan == Aggregate false, [chan#23015,PartialGroup#23108L], [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS _c2#23103L] Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) Aggregate true, [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300)))], [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016 - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS PartialSum#23107L] PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map at newParquet.scala:562 {quote} *Why? ScalaUdf is so slow?? How to improve it?* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6112) Provide OffHeap support through HDFS RAM_DISK
[ https://issues.apache.org/jira/browse/SPARK-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376849#comment-14376849 ] Reynold Xin commented on SPARK-6112: [~zhanzhang] I created https://issues.apache.org/jira/browse/SPARK-6479 as the first step of this: to create an API. Do you mind uploading the API part of this design doc (or even just the entire design doc) to that ticket? Thanks. Provide OffHeap support through HDFS RAM_DISK - Key: SPARK-6112 URL: https://issues.apache.org/jira/browse/SPARK-6112 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Zhan Zhang Attachments: SparkOffheapsupportbyHDFS.pdf HDFS Lazy_Persist policy provide possibility to cache the RDD off_heap in hdfs. We may want to provide similar capacity to Tachyon by leveraging hdfs RAM_DISK feature, if the user environment does not have tachyon deployed. With this feature, it potentially provides possibility to share RDD in memory across different jobs and even share with jobs other than spark, and avoid the RDD recomputation if executors crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2973: Priority: Blocker (was: Critical) Use LocalRelation for all ExecutedCommands, avoid job for take/collect() Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2973) Use LocalRelation for all ExecutedCommands, avoid job for take/collect()
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2973: Summary: Use LocalRelation for all ExecutedCommands, avoid job for take/collect() (was: Add a way to show tables without executing a job) Use LocalRelation for all ExecutedCommands, avoid job for take/collect() Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 Right now, sql(show tables).collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6482) Remove synchronization of Hive Native commands
David Ross created SPARK-6482: - Summary: Remove synchronization of Hive Native commands Key: SPARK-6482 URL: https://issues.apache.org/jira/browse/SPARK-6482 Project: Spark Issue Type: Improvement Reporter: David Ross As discussed in https://issues.apache.org/jira/browse/SPARK-4908, concurrent hive native commands run into thread-safety issues with {{org.apache.hadoop.hive.ql.Driver}}. The quick-fix was to synchronize calls to {{runHive}}: https://github.com/apache/spark/commit/480bd1d2edd1de06af607b0cf3ff3c0b16089add However, if the hive native command is long-running, this can block subsequent queries if they have native dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6478) new RDD.pipeWithPartition method
Maxim Ivanov created SPARK-6478: --- Summary: new RDD.pipeWithPartition method Key: SPARK-6478 URL: https://issues.apache.org/jira/browse/SPARK-6478 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Maxim Ivanov Priority: Minor This method allows building command line args and process environement map using partition as an argument. Use case for this feature is to provide additional informatin about the partition to spawned application in case where partitioner provides it (like in cassandra connector or when custom partitioner/RDD is used). Also it provides simpler and more intuitive alternative for printPipeContext function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6478) new RDD.pipeWithPartition method
[ https://issues.apache.org/jira/browse/SPARK-6478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376827#comment-14376827 ] Apache Spark commented on SPARK-6478: - User 'redbaron' has created a pull request for this issue: https://github.com/apache/spark/pull/5147 new RDD.pipeWithPartition method Key: SPARK-6478 URL: https://issues.apache.org/jira/browse/SPARK-6478 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Maxim Ivanov Priority: Minor Labels: pipe This method allows building command line args and process environement map using partition as an argument. Use case for this feature is to provide additional informatin about the partition to spawned application in case where partitioner provides it (like in cassandra connector or when custom partitioner/RDD is used). Also it provides simpler and more intuitive alternative for printPipeContext function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833 ] Allan Douglas R. de Oliveira edited comment on SPARK-5928 at 3/23/15 10:54 PM: --- I will answer with the info I have right know, later I'll try to get more information: a) I don't have the exact number now, but it's something like 300G of shuffle read/write b) It was working with 1600 partitions before this started to happen, then after the error we increased to 1 but got pretty much the same exception. c) Yes, it happens some minutes later generally. We've put the job back in production changing to lz4. But I feel that the problem will come back. d) Yes, and I think that issue also happens with other errors (e.g if the BlockManager timeouts because of too much GC) was (Author: douglaz): I will answer with the info I have right know, later I'll try to get more information: a) I don't have the exact number now, but it's something like 300G of shuffle read/write b) It was working with 1600 partitions before this started to happen, then after the error we increased to 1 but got pretty much the same exception. c) Yes, it happens some minutes later generally. We've got the job back in production changing to lz4. But I feel that the problem will come back. d) Yes, and I think that issue also happens with other errors (e.g if the BlockManager timeouts because of too much GC) Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at
[jira] [Created] (SPARK-6479) Create off-heap block storage API
Reynold Xin created SPARK-6479: -- Summary: Create off-heap block storage API Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6450) Native Parquet reader does not assign table name as qualifier
[ https://issues.apache.org/jira/browse/SPARK-6450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6450: Summary: Native Parquet reader does not assign table name as qualifier (was: Self joining query failure) Native Parquet reader does not assign table name as qualifier - Key: SPARK-6450 URL: https://issues.apache.org/jira/browse/SPARK-6450 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Anand Mohan Tumuluri Assignee: Cheng Lian Priority: Blocker The below query was working fine till 1.3 commit 9a151ce58b3e756f205c9f3ebbbf3ab0ba5b33fd.(Yes it definitely works at this commit although this commit is completely unrelated) It got broken in 1.3.0 release with an AnalysisException: resolved attributes ... missing from (although this list contains the fields which it reports missing) {code} at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:189) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy17.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} {code} select Orders.Country, Orders.ProductCategory,count(1) from Orders join (select Orders.Country, count(1) CountryOrderCount from Orders where to_date(Orders.PlacedDate) '2015-01-01' group by Orders.Country order by CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = Orders.Country where to_date(Orders.PlacedDate) '2015-01-01' group by Orders.Country,Orders.ProductCategory; {code} The temporary workaround is to add explicit alias for the table Orders {code} select o.Country, o.ProductCategory,count(1) from Orders o join (select r.Country, count(1) CountryOrderCount from Orders r where to_date(r.PlacedDate) '2015-01-01' group by r.Country order by CountryOrderCount DESC LIMIT 5) Top5Countries on Top5Countries.Country = o.Country where to_date(o.PlacedDate) '2015-01-01' group by o.Country,o.ProductCategory; {code} However this change not only affects self joins, it also seems to affect union queries as well, like the below query which was again working before(commit 9a151ce) got broken {code} select Orders.Country,null,count(1) OrderCount from Orders group by Orders.Country,null union all select null,Orders.ProductCategory,count(1) OrderCount from Orders group by null, Orders.ProductCategory {code} also fails with a Analysis exception. The workaround is to add different aliases for the tables. -- This message was sent
[jira] [Commented] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376978#comment-14376978 ] Zhan Zhang commented on SPARK-6479: --- The current API may not be good enough as it has some redundant interface and mainly for POC to make HDFS work. After migrating Tachyon to these new APIs, we should have a better version. I will update the doc later. But I do agree that unify offheap api should be separated JIRA, and we can start from here. Create off-heap block storage API (internal) Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Attachments: SparkOffheapsupportbyHDFS.pdf Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377034#comment-14377034 ] Nicholas Chammas commented on SPARK-6481: - I'm guessing this will be done via [github_jira_sync.py|https://github.com/apache/spark/blob/master/dev/github_jira_sync.py]. OK, will take a look this week. Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6122) Upgrade Tachyon dependency to 0.6.0
[ https://issues.apache.org/jira/browse/SPARK-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376813#comment-14376813 ] Calvin Jia commented on SPARK-6122: --- [~pwendell] Are you referring to the issues here: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/1940/ ? Upgrade Tachyon dependency to 0.6.0 --- Key: SPARK-6122 URL: https://issues.apache.org/jira/browse/SPARK-6122 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Haoyuan Li Assignee: Calvin Jia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376833#comment-14376833 ] Allan Douglas R. de Oliveira commented on SPARK-5928: - I will answer with the info I have right know, later I'll try to get more information: a) I don't have the exact number now, but it's something like 300G of shuffle read/write b) It was working with 1600 partitions before this started to happen, then after the error we increased to 1 but got pretty much the same exception. c) Yes, it happens some minutes later generally. We've got the job back in production changing to lz4. But I feel that the problem will come back. d) Yes, and I think that issue also happens with other errors (e.g if the BlockManager timeouts because of too much GC) Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at
[jira] [Updated] (SPARK-6479) Create off-heap block storage API (internal)
[ https://issues.apache.org/jira/browse/SPARK-6479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6479: --- Summary: Create off-heap block storage API (internal) (was: Create off-heap block storage API) Create off-heap block storage API (internal) Key: SPARK-6479 URL: https://issues.apache.org/jira/browse/SPARK-6479 Project: Spark Issue Type: Improvement Components: Block Manager, Spark Core Reporter: Reynold Xin Would be great to create APIs for off-heap block stores, rather than doing a bunch of if statements everywhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter
[ https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6369: Assignee: Cheng Lian InsertIntoHiveTable should use logic from SparkHadoopWriter --- Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377108#comment-14377108 ] Apache Spark commented on SPARK-1684: - User 'texasmichelle' has created a pull request for this issue: https://github.com/apache/spark/pull/5149 Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Labels: starter If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6100) Distributed linear algebra in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377208#comment-14377208 ] Xiangrui Meng commented on SPARK-6100: -- We don't have APIs for distributed matrices in Python. Now we have MatrixUDT merged, it would be nice to create RowMatrix/CoordinateMatrix/BlockMatrix and use DataFrames for serialization. Do you want to start with that task? Distributed linear algebra in PySpark/MLlib --- Key: SPARK-6100 URL: https://issues.apache.org/jira/browse/SPARK-6100 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is an umbrella JIRA for the Python API of distributed linear algebra in MLlib. The goal is to make Python API on par with the Scala/Java API. We should try wrapping Scala implementations as much as possible, instead of implementing them in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
Xiangrui Meng created SPARK-6488: Summary: Support addition/multiplication in PySpark's BlockMatrix Key: SPARK-6488 URL: https://issues.apache.org/jira/browse/SPARK-6488 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We should reuse the Scala implementation instead of having a separate implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377249#comment-14377249 ] Ryan Williams commented on SPARK-6449: -- It doesn't look like it; [here is a gist|https://gist.github.com/ryan-williams/ff74066c127546910cac] with the entire file (8M), and the last 1000 lines, fwiw. Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377251#comment-14377251 ] Ryan Williams commented on SPARK-6449: -- Seems like this was fixed as of [SPARK-6018|https://issues.apache.org/jira/browse/SPARK-6018], closing Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6475) DataFrame should support array types when creating DFs from JavaBeans.
[ https://issues.apache.org/jira/browse/SPARK-6475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6475: Component/s: (was: DataFrame) DataFrame should support array types when creating DFs from JavaBeans. -- Key: SPARK-6475 URL: https://issues.apache.org/jira/browse/SPARK-6475 Project: Spark Issue Type: Improvement Components: SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng If we have a JavaBean class with array fields, SQL throws an exception in `createDataFrame` because arrays are not matched in `getSchema` from a JavaBean class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5941) `def table` is not using the unresolved logical plan `UnresolvedRelation`
[ https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5941: Component/s: (was: DataFrame) SQL `def table` is not using the unresolved logical plan `UnresolvedRelation` - Key: SPARK-5941 URL: https://issues.apache.org/jira/browse/SPARK-5941 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly
[ https://issues.apache.org/jira/browse/SPARK-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377080#comment-14377080 ] Josh Rosen commented on SPARK-6484: --- To provide some extra context for this JIRA, I think the problem here is that the Spark driver's executor ID is {{driver}}, which gets included in the metrics names, breaking things when it's not escaped properly. Given that this {{driver}} has caused problems in other places, too, I'd suggest that we change it to something else (maybe just {{driver}}). If we consider this identifier to be a stable public API, though, then I guess we'd have to resort to fixing the escaping. Ganglia metrics xml reporter doesn't escape correctly - Key: SPARK-6484 URL: https://issues.apache.org/jira/browse/SPARK-6484 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Michael Armbrust Assignee: Josh Rosen Priority: Critical The following should be escaped: {code} quot; ' apos; lt; gt; amp; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5368) Spark should support NAT (via akka improvements)
[ https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375163#comment-14375163 ] jay vyas edited comment on SPARK-5368 at 3/24/15 1:59 AM: -- Okay, to close *optionally* with the close someone could PR an update to spark {{docs/configuration.md}} as part of closing, to clarify how the SPARK_LOCAL_HOSTNAME (edit) changes akka's behavior, so that the context is clarified... was (Author: jayunit100): Okay, to close *optionally* with the close someone could PR an update to spark {{docs/configuration.md}} as part of closing, to clarify how the {{LOCAL_IP}} changes akka's behavior, so that the context is clarified... Spark should support NAT (via akka improvements) - Key: SPARK-5368 URL: https://issues.apache.org/jira/browse/SPARK-5368 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: jay vyas Fix For: 1.2.2 Spark sets up actors for akka with a set of variables which are defined in the {{AkkaUtils.scala}} class. A snippet: {noformat} 98 |akka.loggers = [akka.event.slf4j.Slf4jLogger] 99 |akka.stdout-loglevel = ERROR 100 |akka.jvm-exit-on-fatal-error = off 101 |akka.remote.require-cookie = $requireCookie 102 |akka.remote.secure-cookie = $secureCookie {noformat} We should allow users to pass in custom settings, for example, so that arbitrary akka modifications can be used at runtime for security, performance, logging, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)
[ https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377096#comment-14377096 ] Matthew Farrellee commented on SPARK-5368: -- [~jayunit100] the relevant config is {{LOCAL_HOSTNAME}} Spark should support NAT (via akka improvements) - Key: SPARK-5368 URL: https://issues.apache.org/jira/browse/SPARK-5368 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: jay vyas Fix For: 1.2.2 Spark sets up actors for akka with a set of variables which are defined in the {{AkkaUtils.scala}} class. A snippet: {noformat} 98 |akka.loggers = [akka.event.slf4j.Slf4jLogger] 99 |akka.stdout-loglevel = ERROR 100 |akka.jvm-exit-on-fatal-error = off 101 |akka.remote.require-cookie = $requireCookie 102 |akka.remote.secure-cookie = $secureCookie {noformat} We should allow users to pass in custom settings, for example, so that arbitrary akka modifications can be used at runtime for security, performance, logging, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6361) Support adding a column with metadata in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377187#comment-14377187 ] Apache Spark commented on SPARK-6361: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/5151 Support adding a column with metadata in DataFrames --- Key: SPARK-6361 URL: https://issues.apache.org/jira/browse/SPARK-6361 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng There is no easy way to add a column with metadata in DataFrames. This is required by ML transformers to generate ML attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
Xiangrui Meng created SPARK-6485: Summary: Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark Key: SPARK-6485 URL: https://issues.apache.org/jira/browse/SPARK-6485 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark. Internally, we can use DataFrames for serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6486: - Description: We should add BlockMatrix to PySpark. Internally, we can use DataFrames and MatrixUDT for serialization. This JIRA should contain conversions between IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover linear algebra operations of block matrices. (was: We should add BlockMatrix to PySpark. Internally, we can use DataFrames and MatrixUDT for serialization. This JIRA does NOT cover linear algebra operations of block matrices.) Add BlockMatrix in PySpark -- Key: SPARK-6486 URL: https://issues.apache.org/jira/browse/SPARK-6486 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add BlockMatrix to PySpark. Internally, we can use DataFrames and MatrixUDT for serialization. This JIRA should contain conversions between IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover linear algebra operations of block matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
Zhang JiaJin created SPARK-6487: --- Summary: Add sequential pattern mining algorithm to Spark MLlib Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: Distributed PrefixSpan algorithm based on MapReduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns
Konstantin Shaposhnikov created SPARK-6489: -- Summary: Optimize lateral view with explode to not read unnecessary columns Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L] Project [name#0,d#6] Generate explode(data#2), true, false PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377234#comment-14377234 ] Xiangrui Meng commented on SPARK-6487: -- [~Zhang JiaJin] I'm not very familiar with patten mining, but I don't see many citations of the paper you mentioned. So I need more information to understand the importance/popularity of sequential pattern mining and whether there exist really scalable algorithms. If there are not many requests for this feature or there are no scalable algorithms, you can certainly register your implementation as a third-party package on spark-packages.org and maintain it outside Spark for users. Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: Distributed PrefixSpan algorithm based on MapReduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns
[ https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shaposhnikov updated SPARK-6489: --- Description: Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L] Project [name#0,d#21] Generate explode(data#2), true, false InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35), Some(ppl)) {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used that do not support column pruning val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} was: Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#8L) AS _c1#3L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#6, LongType)) AS PartialSum#8L] Project [name#0,d#6] Generate explode(data#2), true, false PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35 {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} Optimize lateral view with explode to not read unnecessary columns -- Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L] Project [name#0,d#21] Generate explode(data#2), true, false InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35),
[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile
[ https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377264#comment-14377264 ] Pei-Lun Lee commented on SPARK-6352: The above PR adds a new hadoop config value spark.sql.parquet.output.committer.class to let user select the output committer class used in saving parquet file, similar to SPARK-3595. The base class is ParquetOutputCommitter. There is also a DirectParquetOutputCommitter added for file system like s3. Supporting non-default OutputCommitter when using saveAsParquetFile --- Key: SPARK-6352 URL: https://issues.apache.org/jira/browse/SPARK-6352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Pei-Lun Lee SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can be nice to have similar behavior in saveAsParquetFile. It maybe difficult to have a fully customizable OutputCommitter solution, at least adding something like a DirectParquetOutputCommitter and letting users choose between this and the default should be enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly
Michael Armbrust created SPARK-6484: --- Summary: Ganglia metrics xml reporter doesn't escape correctly Key: SPARK-6484 URL: https://issues.apache.org/jira/browse/SPARK-6484 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Michael Armbrust Assignee: Josh Rosen Priority: Critical The following should be escaped: {code} quot; ' apos; lt; gt; amp; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5368) Spark should support NAT (via akka improvements)
[ https://issues.apache.org/jira/browse/SPARK-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377106#comment-14377106 ] jay vyas commented on SPARK-5368: - looks like this is subsumed maybe by the work going on in SPARK-5113 's doc improvements? Either way, I propose to close this JIRA, i think the problem is solved and the doc improvement was just optional. Spark should support NAT (via akka improvements) - Key: SPARK-5368 URL: https://issues.apache.org/jira/browse/SPARK-5368 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: jay vyas Fix For: 1.2.2 Spark sets up actors for akka with a set of variables which are defined in the {{AkkaUtils.scala}} class. A snippet: {noformat} 98 |akka.loggers = [akka.event.slf4j.Slf4jLogger] 99 |akka.stdout-loglevel = ERROR 100 |akka.jvm-exit-on-fatal-error = off 101 |akka.remote.require-cookie = $requireCookie 102 |akka.remote.secure-cookie = $secureCookie {noformat} We should allow users to pass in custom settings, for example, so that arbitrary akka modifications can be used at runtime for security, performance, logging, and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6430) Cannot resolve column correctlly when using left semi join
[ https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377139#comment-14377139 ] zzc commented on SPARK-6430: what's wrong with this? Cannot resolve column correctlly when using left semi join -- Key: SPARK-6430 URL: https://issues.apache.org/jira/browse/SPARK-6430 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Spark 1.3.0 on yarn mode Reporter: zzc My code: {quote} case class TestData(key: Int, value: String) case class TestData2(a: Int, b: Int) import org.apache.spark.sql.execution.joins._ import sqlContext.implicits._ val testData = sc.parallelize( (1 to 100).map(i = TestData(i, i.toString))).toDF() testData.registerTempTable(testData) val testData2 = sc.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toDF() testData2.registerTempTable(testData2) //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 ON key = a ) val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by testData2.b) tmp.explain() {quote} Error log: {quote} org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given input columns key, value; line 1 pos 108 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) {quote} {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is correct, {quote} SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a group by testData2.b {quote} are incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377155#comment-14377155 ] iward commented on SPARK-3720: -- [~zhanzhang], I see. since the patch is delayed, so we can't use orcFile API in spark currently. But, the problem of reading whole files is urgent, do we have other way to solve this in spark ? support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: Fei Wang Attachments: orc.diff The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS
[ https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377200#comment-14377200 ] Xiangrui Meng commented on SPARK-3735: -- The proposal is actually something different. For example, in a single block if we have 1000 items that are rated by all users. We need to send those 1000 item factors to all user in-blocks. The shuffle size is 1000*k*numUserBlocks. However, if we only send the partial AtA and Atb generated by those items, the shuffle size is k*(k+1)*numUserBlocks, which is much smaller if k 1000. This is orthogonal to explicit/implicit feedback models. Sending the factor directly or AtA based on the cost in ALS --- Key: SPARK-3735 URL: https://issues.apache.org/jira/browse/SPARK-3735 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is common to have some super popular products in the dataset. In this case, sending many user factors to the target product block could be more expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i r_ij` to the product block. The cost of sending a single factor is `k`, while the cost of sending a normal equation is much more expensive, `k * (k + 3) / 2`. However, if we use normal equation for all products associated with a user, we don't need to send this user factor. Determining the optimal assignment is hard. But we could use a simple heuristic. Inside any rating block, 1) order the product ids by the number of user ids associated with them in desc order 2) starting from the most popular product, mark popular products as use normal eq and calculate the cost Remember the best assignment that comes with the lowest cost and use it for computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3278) Isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377205#comment-14377205 ] Xiangrui Meng commented on SPARK-3278: -- Did you try truncating the digits of x to reduce the number of possible buckets? If the loss of precision is not super important, this could help scalability. Isotonic regression --- Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add isotonic regression for score calibration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd
[ https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377206#comment-14377206 ] Apache Spark commented on SPARK-6464: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/5152 Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd --- Key: SPARK-6464 URL: https://issues.apache.org/jira/browse/SPARK-6464 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: SaintBacchus Attachments: screenshot-1.png Nowadays, the transformation *coalesce* was always used to expand or reduce the number of the partition in order to gain a good performance. But *coalesce* can't make sure that the child partition will be executed in the same executor as the parent partition. And this will lead to have a large network transfer. In some scenario such as I mentioned in the title +small and cached rdd+, we want to coalesce all the partition in the same executor into one partition and make sure the child partition will be executed in this executor. It can avoid network transfer and reduce the scheduler of the Tasks and also can reused the cpu core to do other job. In this scenario, our performance had improved 20% than before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang JiaJin updated SPARK-6487: Description: [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: Distributed PrefixSpan algorithm based on MapReduce. was: Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: Distributed PrefixSpan algorithm based on MapReduce. Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: Distributed PrefixSpan algorithm based on MapReduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams resolved SPARK-6449. -- Resolution: Implemented Fix Version/s: 1.3.0 Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams Fix For: 1.3.0 I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5692: - Assignee: Manoj Kumar (was: ANUPAM MEDIRATTA) Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377266#comment-14377266 ] Xiangrui Meng commented on SPARK-5692: -- [~anupamme] You should get familiar with Scala and Spark development first before working on specific JIRAs. I've assigned this ticket to [~MechCoder]. There are many open JIRAs for MLlib. Once you are familiar with Scala/Spark, feel free to ping me on a JIRA that you are interested in. Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: ANUPAM MEDIRATTA Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6189: Component/s: (was: DataFrame) Pandas to DataFrame conversion should check field names for periods --- Key: SPARK-6189 URL: https://issues.apache.org/jira/browse/SPARK-6189 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Issue I ran into: I imported an R dataset in CSV format into a Pandas DataFrame and then use toDF() to convert that into a Spark DataFrame. The R dataset had a column with a period in it (column GNP.deflator in the longley dataset). When I tried to select it using the Spark DataFrame DSL, I could not because the DSL thought the period was selecting a field within GNP. Also, since GNP is another field's name, it gives an error which could be obscure to users, complaining: {code} org.apache.spark.sql.AnalysisException: GetField is not valid on fields of type DoubleType; {code} We should either handle periods in column names or check during loading and warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5919) Enable broadcast joins for Parquet files
[ https://issues.apache.org/jira/browse/SPARK-5919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5919: Component/s: (was: DataFrame) SQL Enable broadcast joins for Parquet files Key: SPARK-5919 URL: https://issues.apache.org/jira/browse/SPARK-5919 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Dima Zhiyanov Unable to perform broadcast join of Schema RDDs created from Parquet files. Computing statistics is only available for real Hive tables, and it is not always convenient to create a Hive table for every Parquet file The issue is discussed here http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-SparkSQL-td15298.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module
[ https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377102#comment-14377102 ] Marcelo Vanzin commented on SPARK-6229: --- Hi, me again. So I finally got back to actually playing with the code today, and I was trying out what it would take to implement my suggestion. While I think it's worthwhile to abstract as much set up as possible behind the library code, that change in itself is becoming too large and kinda taking away from the focus of the change (adding encryption). So at this point I think it would be easier to go with something smaller that, while not optimal in my view, at least is more targeted, and we can do cleanup, if desired, separately. So my current plan is to go with something not entirely unlike what Aaron is saying. * On the client side, TransportClientBootstrap would expose the channel to the bootstrap implementation, so that the SASL bootstrap can, after negotiation, insert itself into the pipeline to perform encryption. * On the server side, I'll probably need to build something similar to TransportClientBootstrap. I haven't really looked at what the code would look like, but this will most probably require changes to all call sites that use SaslRpcHandler at the moment. So hopefully this will be a much smaller change that is also easier to review. Support SASL encryption in network/common module Key: SPARK-6229 URL: https://issues.apache.org/jira/browse/SPARK-6229 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin After SASL support has been added to network/common, supporting encryption should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. Since the latter requires a valid kerberos login to work (and so doesn't really work with executors), encryption would require the use of DIGEST-MD5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1684) Merge script should standardize SPARK-XXX prefix
[ https://issues.apache.org/jira/browse/SPARK-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michelle Casbon updated SPARK-1684: --- Attachment: spark_pulls_before_after.txt Test data (spark_pulls_before_after.txt): titles from existing pull requests in their original form and after being modified with the new function. It's not perfect, but it cleans up most of the common problems. Merge script should standardize SPARK-XXX prefix Key: SPARK-1684 URL: https://issues.apache.org/jira/browse/SPARK-1684 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Minor Labels: starter Attachments: spark_pulls_before_after.txt If users write [SPARK-XXX] Issue or SPARK-XXX. Issue or SPARK XXX: Issue we should convert it to SPARK-XXX: Issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377197#comment-14377197 ] Xiangrui Meng commented on SPARK-6192: -- Thanks for the update! The current version looks good to me. Please keep me updated on key events. Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6334) spark-local dir not getting cleared during ALS
[ https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6334. -- Resolution: Duplicate SPARK-5955 was merged. So if you can use the latest master, you can set checkpoint interval to control the shuffle files. I'm closing this issue since there exists a workaround and it is fixed in master. spark-local dir not getting cleared during ALS -- Key: SPARK-6334 URL: https://issues.apache.org/jira/browse/SPARK-6334 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Antony Mayi Attachments: als-diskusage.png when running bigger ALS training spark spills loads of temp data into the local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running out of space (in my case I have 12TB of available disk capacity before kicking off the ALS but it all gets used (and yarn kills the containers when reaching 90%). even with all recommended options (configuring checkpointing and forcing GC when possible) it still doesn't get cleared. here is my (pseudo)code (pyspark): {code} sc.setCheckpointDir('/tmp') training = sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK) model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40) sc._jvm.System.gc() {code} the training RDD has about 3.5 billions of items (~60GB on disk). after about 6 hours the ALS will consume all 12TB of disk space in local-dir data and gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 37 executors of 4 cores/28+4GB RAM each. this is the graph of disk consumption pattern showing the space being all eaten from 7% to 90% during the ALS (90% is when YARN kills the container): !als-diskusage.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org