[jira] [Commented] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
[ https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958441#comment-14958441 ] Apache Spark commented on SPARK-11125: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/9134 > Unreadable exception when running spark-sql without building with > -Phive-thriftserver and SPARK_PREPEND_CLASSES is set > -- > > Key: SPARK-11125 > URL: https://issues.apache.org/jira/browse/SPARK-11125 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > In development environment, building spark without -Phive-thriftserver and > SPARK_PREPEND_CLASSES is set. The following exception is thrown. > SparkSQLCliDriver can be loaded but hive related code could not be loaded. > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/cli/CliDriver > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:412) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.cli.CliDriver > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 21 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
[ https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11125: Assignee: Apache Spark > Unreadable exception when running spark-sql without building with > -Phive-thriftserver and SPARK_PREPEND_CLASSES is set > -- > > Key: SPARK-11125 > URL: https://issues.apache.org/jira/browse/SPARK-11125 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > In development environment, building spark without -Phive-thriftserver and > SPARK_PREPEND_CLASSES is set. The following exception is thrown. > SparkSQLCliDriver can be loaded but hive related code could not be loaded. > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/cli/CliDriver > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:412) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.cli.CliDriver > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 21 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
[ https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11125: Assignee: (was: Apache Spark) > Unreadable exception when running spark-sql without building with > -Phive-thriftserver and SPARK_PREPEND_CLASSES is set > -- > > Key: SPARK-11125 > URL: https://issues.apache.org/jira/browse/SPARK-11125 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > In development environment, building spark without -Phive-thriftserver and > SPARK_PREPEND_CLASSES is set. The following exception is thrown. > SparkSQLCliDriver can be loaded but hive related code could not be loaded. > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/cli/CliDriver > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:412) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.cli.CliDriver > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 21 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11097) Add connection established callback to lower level RPC layer so we don't need to check for new connections in NettyRpcHandler.receive
[ https://issues.apache.org/jira/browse/SPARK-11097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958447#comment-14958447 ] Reynold Xin commented on SPARK-11097: - Simplifies code and improves performance (no need to check connection existence for every message). > Add connection established callback to lower level RPC layer so we don't need > to check for new connections in NettyRpcHandler.receive > - > > Key: SPARK-11097 > URL: https://issues.apache.org/jira/browse/SPARK-11097 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin > > I think we can remove the check for new connections in > NettyRpcHandler.receive if we just add a channel registered callback to the > lower level network module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11128) strange NPE when writing in non-existing S3 bucket
mathieu despriee created SPARK-11128: Summary: strange NPE when writing in non-existing S3 bucket Key: SPARK-11128 URL: https://issues.apache.org/jira/browse/SPARK-11128 Project: Spark Issue Type: Bug Affects Versions: 1.5.1 Reporter: mathieu despriee Priority: Minor For the record, as it's relatively minor, and related to s3n (not tested with s3a). By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, with a simple df.write.parquet(s3path). We got a NPE (see stack trace below), which is very misleading. java.lang.NullPointerException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10893) Lag Analytic function broken
[ https://issues.apache.org/jira/browse/SPARK-10893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958520#comment-14958520 ] Herman van Hovell commented on SPARK-10893: --- A bug was found in the Window implementation. It has been fixed in the current master: https://github.com/apache/spark/commit/6987c067937a50867b4d5788f5bf496ecdfdb62c Could you try out the latest master and see if this is resolved? > Lag Analytic function broken > > > Key: SPARK-10893 > URL: https://issues.apache.org/jira/browse/SPARK-10893 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0 > Environment: Spark Standalone Cluster on Linux >Reporter: Jo Desmet > > Trying to aggregate with the LAG Analytic function gives the wrong result. In > my testcase it was always giving the fixed value '103079215105' when I tried > to run on an integer. > Note that this only happens on Spark 1.5.0, and only when running in cluster > mode. > It works fine when running on Spark 1.4.1, or when running in local mode. > I did not test on a yarn cluster. > I did not test other analytic aggregates. > Input Jason: > {code:borderStyle=solid|title=/home/app/input.json} > {"VAA":"A", "VBB":1} > {"VAA":"B", "VBB":-1} > {"VAA":"C", "VBB":2} > {"VAA":"d", "VBB":3} > {"VAA":null, "VBB":null} > {code} > Java: > {code:borderStyle=solid} > SparkContext sc = new SparkContext(conf); > HiveContext sqlContext = new HiveContext(sc); > DataFrame df = sqlContext.read().json("file:///home/app/input.json"); > > df = df.withColumn( > "previous", > lag(dataFrame.col("VBB"), 1) > .over(Window.orderBy(dataFrame.col("VAA"))) > ); > {code} > Important to understand the conditions under which the job ran, I submitted > to a standalone spark cluster in client mode as follows: > {code:borderStyle=solid} > spark-submit \ > --master spark:\\xx:7077 \ > --deploy-mode client \ > --class package.to.DriverClass \ > --driver-java-options -Dhdp.version=2.2.0.0–2041 \ > --num-executors 2 \ > --driver-memory 2g \ > --executor-memory 2g \ > --executor-cores 2 \ > /path/to/sample-program.jar > {code} > Expected Result: > {code:borderStyle=solid} > {"VAA":null, "VBB":null, "previous":null} > {"VAA":"A", "VBB":1, "previous":null} > {"VAA":"B", "VBB":-1, "previous":1} > {"VAA":"C", "VBB":2, "previous":-1} > {"VAA":"d", "VBB":3, "previous":2} > {code} > Actual Result: > {code:borderStyle=solid} > {"VAA":null, "VBB":null, "previous":103079215105} > {"VAA":"A", "VBB":1, "previous":103079215105} > {"VAA":"B", "VBB":-1, "previous":103079215105} > {"VAA":"C", "VBB":2, "previous":103079215105} > {"VAA":"d", "VBB":3, "previous":103079215105} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958524#comment-14958524 ] Cheng Hao commented on SPARK-4226: -- [~nadenf] Actually I am working on it right now, and the first PR is ready, it will be great appreciated if you can try https://github.com/apache/spark/pull/9055 in your local testing, let me know if there any problem or bug you found. > SparkSQL - Add support for subqueries in predicates > --- > > Key: SPARK-4226 > URL: https://issues.apache.org/jira/browse/SPARK-4226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 > Environment: Spark 1.2 snapshot >Reporter: Terry Siu > > I have a test table defined in Hive as follows: > {code:sql} > CREATE TABLE sparkbug ( > id INT, > event STRING > ) STORED AS PARQUET; > {code} > and insert some sample data with ids 1, 2, 3. > In a Spark shell, I then create a HiveContext and then execute the following > HQL to test out subquery predicates: > {code} > val hc = HiveContext(hc) > hc.hql("select customerid from sparkbug where customerid in (select > customerid from sparkbug where customerid in (2,3))") > {code} > I get the following error: > {noformat} > java.lang.RuntimeException: Unsupported language features in query: select > customerid from sparkbug where customerid in (select customerid from sparkbug > where customerid in (2,3)) > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_SUBQUERY_EXPR > TOK_SUBQUERY_OP > in > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_FUNCTION > in > TOK_TABLE_OR_COL > customerid > 2 > 3 > TOK_TABLE_OR_COL > customerid > scala.NotImplementedError: No parse rules for ASTNode type: 817, text: > TOK_SUBQUERY_EXPR : > TOK_SUBQUERY_EXPR > TOK_SUBQUERY_OP > in > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_FUNCTION > in > TOK_TABLE_OR_COL > customerid > 2 > 3 > TOK_TABLE_OR_COL > customerid > " + > > org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) > > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > {noformat} > [This > thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html] > also brings up lack of subquery support in SparkSQL. It would be nice to > have subquery predicate support in a near, future release (1.3, maybe?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6065) Optimize word2vec.findSynonyms speed
[ https://issues.apache.org/jira/browse/SPARK-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958585#comment-14958585 ] Franck Zhang commented on SPARK-6065: -- When I used the same dataset (text8 - around100mb), same parameters for training, python runs 10x faster than spark in my notebook(2015 MacBook Pro 15") I think the word2vec model in spark still have a long way to go ... > Optimize word2vec.findSynonyms speed > > > Key: SPARK-6065 > URL: https://issues.apache.org/jira/browse/SPARK-6065 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > Fix For: 1.4.0 > > > word2vec.findSynonyms iterates through the entire vocabulary to find similar > words. This is really slow relative to the [gcode-hosted word2vec > implementation | https://code.google.com/p/word2vec/]. It should be > optimized by storing words in a datastructure designed for finding nearest > neighbors. > This would require storing a copy of the model (basically an inverted > dictionary), which could be a problem if users have a big model (e.g., 100 > features x 10M words or phrases = big dictionary). It might be best to > provide a function for converting the model into a model optimized for > findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11128) strange NPE when writing in non-existing S3 bucket
[ https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958756#comment-14958756 ] Sean Owen commented on SPARK-11128: --- Is this a Spark problem? sounds like an issue between Hadoop and S3, and is ultimately due to bad input. > strange NPE when writing in non-existing S3 bucket > -- > > Key: SPARK-11128 > URL: https://issues.apache.org/jira/browse/SPARK-11128 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: mathieu despriee >Priority: Minor > > For the record, as it's relatively minor, and related to s3n (not tested with > s3a). > By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, > with a simple df.write.parquet(s3path). > We got a NPE (see stack trace below), which is very misleading. > java.lang.NullPointerException > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958758#comment-14958758 ] Herman van Hovell commented on SPARK-9241: -- We could implement this using GROUPING SETS. That is how they did it in Calcite: https://issues.apache.org/jira/browse/CALCITE-732 For example using the following data: {noformat} // Create random data similar to the Calcite query. val df = sqlContext .range(1 << 20) .select( $"id".as("employee_id"), (rand(6321782L) * 4 + 1).cast("int").as("department_id"), when(rand(981293L) >= 0.5, "M").otherwise("F").as("gender"), (rand(7123L) * 3 + 1).cast("int").as("education_level") ) df.registerTempTable("employee") {noformat} We can query multiple distinct counts the regular way: {noformat} sql(""" select department_id as d, count(distinct gender, education_level) as c0, count(distinct gender) as c1, count(distinct education_level) as c2 from employee group by department_id """).show() {noformat} This uses the old code path: {noformat} == Physical Plan == Limit 21 Aggregate false, [department_id#64556], [department_id#64556 AS d#64595,CombineAndCount(partialSets#64599) AS c0#64596L,CombineAndCount(partialSets#64600) AS c1#64597L,CombineAndCount(partialSets#64601) AS c2#64598L] Exchange hashpartitioning(department_id#64556,200) Aggregate true, [department_id#64556], [department_id#64556,AddToHashSet(gender#64557,education_level#64558) AS partialSets#64599,AddToHashSet(gender#64557) AS partialSets#64600,AddToHashSet(education_level#64558) AS partialSets#64601] ConvertToSafe TungstenProject [department_id#64556,gender#64557,education_level#64558] TungstenProject [id#64554L AS employee_id#64555L,cast(((rand(6321782) * 4.0) + 1.0) as int) AS department_id#64556,CASE WHEN (rand(981293) >= 0.5) THEN M ELSE F AS gender#64557,cast(((rand(7123) * 3.0) + 1.0) as int) AS education_level#64558] Scan PhysicalRDD[id#64554L] {noformat} Or we can do this using grouping sets: {noformat} sql(""" select A.d, count(case A.i when 3 then 1 else null end) as c0, count(case A.i when 5 then 1 else null end) as c1, count(case A.i when 7 then 1 else null end) as c2 from (select department_id as d, grouping__id as i from employee group by department_id, gender, education_level grouping sets ( (department_id, gender), (department_id, education_level), (department_id, gender, education_level))) A group by A.d """).show {noformat} And use the new tungsten-based code path (except for the Expand operator): {noformat} == Physical Plan == TungstenAggregate(key=[d#64577], functions=[(count(CASE i#64578 WHEN 3 THEN 1 ELSE null),mode=Final,isDistinct=false),(count(CASE i#64578 WHEN 5 THEN 1 ELSE null),mode=Final,isDistinct=false),(count(CASE i#64578 WHEN 7 THEN 1 ELSE null),mode=Final,isDistinct=false)], output=[d#64577,c0#64579L,c1#64580L,c2#64581L]) TungstenExchange hashpartitioning(d#64577,200) TungstenAggregate(key=[d#64577], functions=[(count(CASE i#64578 WHEN 3 THEN 1 ELSE null),mode=Partial,isDistinct=false),(count(CASE i#64578 WHEN 5 THEN 1 ELSE null),mode=Partial,isDistinct=false),(count(CASE i#64578 WHEN 7 THEN 1 ELSE null),mode=Partial,isDistinct=false)], output=[d#64577,currentCount#64587L,currentCount#64589L,currentCount#64591L]) TungstenAggregate(key=[department_id#64556,gender#64557,education_level#64558,grouping__id#64582], functions=[], output=[d#64577,i#64578]) TungstenExchange hashpartitioning(department_id#64556,gender#64557,education_level#64558,grouping__id#64582,200) TungstenAggregate(key=[department_id#64556,gender#64557,education_level#64558,grouping__id#64582], functions=[], output=[department_id#64556,gender#64557,education_level#64558,grouping__id#64582]) Expand [ArrayBuffer(department_id#64556, gender#64557, null, 3),ArrayBuffer(department_id#64556, null, education_level#64558, 5),ArrayBuffer(department_id#64556, gender#64557, education_level#64558, 7)], [department_id#64556,gender#64557,education_level#64558,grouping__id#64582] ConvertToSafe TungstenProject [department_id#64556,gender#64557,education_level#64558] TungstenProject [id#64554L AS employee_id#64555L,cast(((rand(6321782) * 4.0) + 1.0) as int) AS department_id#64556,CASE WHEN (rand(981293) >= 0.5) THEN M ELSE F AS gender#64557,cast(((rand(7123) * 3.0) + 1.0) as int) AS education_level#64558] Scan PhysicalRDD[id#64554L] {noformat} We could implement this using an analysis rule. [~yhuai] / [~rxin] thoughts? > Supporting multiple DISTINCT columns > > > Key: SPARK-9241 > URL: https://issues.apache.org/jira/browse/SPARK-9241 >
[jira] [Resolved] (SPARK-10528) spark-shell throws java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable.
[ https://issues.apache.org/jira/browse/SPARK-10528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10528. --- Resolution: Not A Problem This looks like an environment problem. > spark-shell throws java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. > -- > > Key: SPARK-10528 > URL: https://issues.apache.org/jira/browse/SPARK-10528 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.5.0 > Environment: Windows 7 x64 >Reporter: Aliaksei Belablotski >Priority: Minor > > Starting spark-shell throws > java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: > /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11120) maxNumExecutorFailures defaults to 3 under dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958762#comment-14958762 ] Sean Owen commented on SPARK-11120: --- Is this specific to dynamic allocation though? you could have the same problem without it. > maxNumExecutorFailures defaults to 3 under dynamic allocation > - > > Key: SPARK-11120 > URL: https://issues.apache.org/jira/browse/SPARK-11120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > With dynamic allocation, the {{spark.executor.instances}} config is 0, > meaning [this > line|https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L66-L68] > ends up with {{maxNumExecutorFailures}} equal to {{3}}, which for me has > resulted in large dynamicAllocation jobs with hundreds of executors dying due > to one bad node serially failing executors that are allocated on it. > I think that using {{spark.dynamicAllocation.maxExecutors}} would make most > sense in this case; I frequently run shells that vary between 1 and 1000 > executors, so using {{s.dA.minExecutors}} or {{s.dA.initialExecutors}} would > still leave me with a value that is lower than makes sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks
[ https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958823#comment-14958823 ] Kristina Plazonic commented on SPARK-10935: --- @Xusen, I'm almost done - should be done this weekend - but would love to connect with you and get your comments, suggestions and improvements. :) Thanks! > Avito Context Ad Clicks > --- > > Key: SPARK-10935 > URL: https://issues.apache.org/jira/browse/SPARK-10935 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > From [~kpl...@gmail.com]: > I would love to do Avito Context Ad Clicks - > https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of > feature engineering and preprocessing. I would love to split this with > somebody else if anybody is interested on working with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11129) Link Spark WebUI in Mesos WebUI
Philipp Hoffmann created SPARK-11129: Summary: Link Spark WebUI in Mesos WebUI Key: SPARK-11129 URL: https://issues.apache.org/jira/browse/SPARK-11129 Project: Spark Issue Type: New Feature Components: Mesos, Web UI Affects Versions: 1.5.1 Reporter: Philipp Hoffmann Mesos can directly link into WebUIs provided by frameworks running on top of Mesos. Spark currently doesn't make use of this feature. This ticket aims to provide the necessary information to Mesos in order to link back to the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958841#comment-14958841 ] Dominic Ricard commented on SPARK-11103: Setting the property {{spark.sql.parquet.filterPushdown}} to {{false}} fixed the issue. Knowing all this, does this indicate a bug in the filter2 implementation of the Parquet library? Maybe this issue should be moved to the Parquet project for someone to look at... > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.comp
[jira] [Commented] (SPARK-11129) Link Spark WebUI in Mesos WebUI
[ https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958856#comment-14958856 ] Philipp Hoffmann commented on SPARK-11129: -- submitted a pull request > Link Spark WebUI in Mesos WebUI > --- > > Key: SPARK-11129 > URL: https://issues.apache.org/jira/browse/SPARK-11129 > Project: Spark > Issue Type: New Feature > Components: Mesos, Web UI >Affects Versions: 1.5.1 >Reporter: Philipp Hoffmann > > Mesos can directly link into WebUIs provided by frameworks running on top of > Mesos. Spark currently doesn't make use of this feature. > This ticket aims to provide the necessary information to Mesos in order to > link back to the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11129) Link Spark WebUI in Mesos WebUI
[ https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11129: Assignee: (was: Apache Spark) > Link Spark WebUI in Mesos WebUI > --- > > Key: SPARK-11129 > URL: https://issues.apache.org/jira/browse/SPARK-11129 > Project: Spark > Issue Type: New Feature > Components: Mesos, Web UI >Affects Versions: 1.5.1 >Reporter: Philipp Hoffmann > > Mesos can directly link into WebUIs provided by frameworks running on top of > Mesos. Spark currently doesn't make use of this feature. > This ticket aims to provide the necessary information to Mesos in order to > link back to the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11129) Link Spark WebUI in Mesos WebUI
[ https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958858#comment-14958858 ] Apache Spark commented on SPARK-11129: -- User 'philipphoffmann' has created a pull request for this issue: https://github.com/apache/spark/pull/9135 > Link Spark WebUI in Mesos WebUI > --- > > Key: SPARK-11129 > URL: https://issues.apache.org/jira/browse/SPARK-11129 > Project: Spark > Issue Type: New Feature > Components: Mesos, Web UI >Affects Versions: 1.5.1 >Reporter: Philipp Hoffmann > > Mesos can directly link into WebUIs provided by frameworks running on top of > Mesos. Spark currently doesn't make use of this feature. > This ticket aims to provide the necessary information to Mesos in order to > link back to the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11129) Link Spark WebUI in Mesos WebUI
[ https://issues.apache.org/jira/browse/SPARK-11129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11129: Assignee: Apache Spark > Link Spark WebUI in Mesos WebUI > --- > > Key: SPARK-11129 > URL: https://issues.apache.org/jira/browse/SPARK-11129 > Project: Spark > Issue Type: New Feature > Components: Mesos, Web UI >Affects Versions: 1.5.1 >Reporter: Philipp Hoffmann >Assignee: Apache Spark > > Mesos can directly link into WebUIs provided by frameworks running on top of > Mesos. Spark currently doesn't make use of this feature. > This ticket aims to provide the necessary information to Mesos in order to > link back to the Spark WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11103) Filter applied on Merged Parquet shema with new column fail with (java.lang.IllegalArgumentException: Column [column_name] was not found in schema!)
[ https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958879#comment-14958879 ] Hyukjin Kwon commented on SPARK-11103: -- For me, I think Spark should appropriately set filters for each file, which I think is pretty tricky, or simply prevent filtering for this case. Would anybody give us some feedback please? > Filter applied on Merged Parquet shema with new column fail with > (java.lang.IllegalArgumentException: Column [column_name] was not found in > schema!) > > > Key: SPARK-11103 > URL: https://issues.apache.org/jira/browse/SPARK-11103 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Dominic Ricard > > When evolving a schema in parquet files, spark properly expose all columns > found in the different parquet files but when trying to query the data, it is > not possible to apply a filter on a column that is not present in all files. > To reproduce: > *SQL:* > {noformat} > create table `table1` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`; > create table `table2` STORED AS PARQUET LOCATION > 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as > `col2`; > create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path > "hdfs://:/path/to/table"); > select col1 from `table3` where col2 = 2; > {noformat} > The last select will output the following Stack Trace: > {noformat} > An error occurred when executing the SQL command: > select col1 from `table3` where col2 = 2 > [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: > 0, SQL state: TStatus(statusCode:ERROR_STATUS, > infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, > most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, > 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not > found in schema! > at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59) > at > org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180) > at > org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59) > at > org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40) > at > org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126) > at > org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155) > at > org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheck
[jira] [Commented] (SPARK-10729) word2vec model save for python
[ https://issues.apache.org/jira/browse/SPARK-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959097#comment-14959097 ] Jian Feng Zhang commented on SPARK-10729: - I can take this if no objections. > word2vec model save for python > -- > > Key: SPARK-10729 > URL: https://issues.apache.org/jira/browse/SPARK-10729 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.1, 1.5.0 >Reporter: Joseph A Gartner III > > The ability to save a word2vec model has not been ported to python, and would > be extremely useful to have given the long training period. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks
[ https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959140#comment-14959140 ] Xusen Yin commented on SPARK-10935: --- OK, ping me if you need help. > Avito Context Ad Clicks > --- > > Key: SPARK-10935 > URL: https://issues.apache.org/jira/browse/SPARK-10935 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > From [~kpl...@gmail.com]: > I would love to do Avito Context Ad Clicks - > https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of > feature engineering and preprocessing. I would love to split this with > somebody else if anybody is interested on working with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959156#comment-14959156 ] Xusen Yin commented on SPARK-5874: -- How about adding a warm-start strategy in ML Estimator? I.e. update its fit function with an intermediate model, like fit(data, param, model). > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9941) Try ML pipeline API on Kaggle competitions
[ https://issues.apache.org/jira/browse/SPARK-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959174#comment-14959174 ] Xusen Yin commented on SPARK-9941: -- I'd love to try the cooking dataset: https://www.kaggle.com/c/whats-cooking > Try ML pipeline API on Kaggle competitions > -- > > Key: SPARK-9941 > URL: https://issues.apache.org/jira/browse/SPARK-9941 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is an umbrella JIRA to track some fun tasks :) > We have built many features under the ML pipeline API, and we want to see how > it works on real-world datasets, e.g., Kaggle competition datasets > (https://www.kaggle.com/competitions). We want to invite community members to > help test. The goal is NOT to win the competitions but to provide code > examples and to find out missing features and other issues to help shape the > roadmap. > For people who are interested, please do the following: > 1. Create a subtask (or leave a comment if you cannot create a subtask) to > claim a Kaggle dataset. > 2. Use the ML pipeline API to build and tune an ML pipeline that works for > the Kaggle dataset. > 3. Paste the code to gist (https://gist.github.com/) and provide the link > here. > 4. Report missing features, issues, running times, and accuracy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions
[ https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959197#comment-14959197 ] Simeon Simeonov commented on SPARK-10217: - Well, that would suggest the issue is fixed. :) > Spark SQL cannot handle ordering directive in ORDER BY clauses with > expressions > --- > > Key: SPARK-10217 > URL: https://issues.apache.org/jira/browse/SPARK-10217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > Labels: SQL, analyzers > > Spark SQL supports expressions in ORDER BY clauses, e.g., > {code} > scala> sqlContext.sql("select cnt from test order by (cnt + cnt)") > res2: org.apache.spark.sql.DataFrame = [cnt: bigint] > {code} > However, the analyzer gets confused when there is an explicit ordering > directive (ASC/DESC): > {code} > scala> sqlContext.sql("select cnt from test order by (cnt + cnt) asc") > 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test > order by (cnt + cnt) asc > org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF > near ''; line 1 pos 40 > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11093) ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader
[ https://issues.apache.org/jira/browse/SPARK-11093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11093. Resolution: Fixed Assignee: Adam Lewandowski Fix Version/s: 1.6.0 > ChildFirstURLClassLoader#getResources should return all found resources, not > just those in the child classloader > > > Key: SPARK-11093 > URL: https://issues.apache.org/jira/browse/SPARK-11093 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Adam Lewandowski >Assignee: Adam Lewandowski > Fix For: 1.6.0 > > > Currently when using a child-first classloader > (spark.driver|executor.userClassPathFirst = true), the getResources method > does not return any matching resources from the parent classloader if the > child classloader contains any. This is not child-first, it's child-only and > is inconsistent with how the default parent-first classloaders work in the > JDK (all found resources are returned from both classloaders). It is also > inconsistent with how child-first classloaders work in other environments > (Servlet containers, for example). > ChildFirstURLClassLoader#getResources() should return resources found from > both the child and the parent classloaders, placing any found from the child > classloader first. > For reference, the specific use case where I encountered this problem was > running Spark on AWS EMR in a child-first arrangement (due to guava version > conflicts), where Akka's configuration file (reference.conf) was made > available in the parent classloader, but was not visible to the Typesafe > config library which uses Classloader.getResources() on the Thread's context > classloader to find them. This resulted in a fatal error from the Config > library: "com.typesafe.config.ConfigException$Missing: No configuration > setting found for key 'akka.version'" . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11099) Default conf property file is not loaded
[ https://issues.apache.org/jira/browse/SPARK-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11099. Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: 1.6.0 > Default conf property file is not loaded > - > > Key: SPARK-11099 > URL: https://issues.apache.org/jira/browse/SPARK-11099 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Critical > Fix For: 1.6.0 > > > spark.driver.extraClassPath doesn't take effect in the latest code, and find > the root cause is due to the default conf property file is not loaded > The bug is caused by this code snippet in AbstractCommandBuilder > {code} > Map getEffectiveConfig() throws IOException { > if (effectiveConfig == null) { > if (propertiesFile == null) { > effectiveConfig = conf; // return from here if no propertyFile > is provided > } else { > effectiveConfig = new HashMap<>(conf); > Properties p = loadPropertiesFile();// default propertyFile > will load here > for (String key : p.stringPropertyNames()) { > if (!effectiveConfig.containsKey(key)) { > effectiveConfig.put(key, p.getProperty(key)); > } > } > } > } > return effectiveConfig; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11130) TestHive fails on machines with few cores
Marcelo Vanzin created SPARK-11130: -- Summary: TestHive fails on machines with few cores Key: SPARK-11130 URL: https://issues.apache.org/jira/browse/SPARK-11130 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0, 1.6.0 Reporter: Marcelo Vanzin Priority: Minor Filing so it doesn't get lost (again). TestHive.scala has this code: {core} new SparkContext( System.getProperty("spark.sql.test.master", "local[32]"), {core} On machines with less cores, that causes many tests to fail with "unable to allocate memory" errors, because the default page size calculation seems to be based on the machine's core count, and not on the core count specified for the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11130) TestHive fails on machines with few cores
[ https://issues.apache.org/jira/browse/SPARK-11130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-11130: --- Description: Filing so it doesn't get lost (again). TestHive.scala has this code: {code} new SparkContext( System.getProperty("spark.sql.test.master", "local[32]"), {code} On machines with less cores, that causes many tests to fail with "unable to allocate memory" errors, because the default page size calculation seems to be based on the machine's core count, and not on the core count specified for the SparkContext. was: Filing so it doesn't get lost (again). TestHive.scala has this code: {core} new SparkContext( System.getProperty("spark.sql.test.master", "local[32]"), {core} On machines with less cores, that causes many tests to fail with "unable to allocate memory" errors, because the default page size calculation seems to be based on the machine's core count, and not on the core count specified for the SparkContext. > TestHive fails on machines with few cores > - > > Key: SPARK-11130 > URL: https://issues.apache.org/jira/browse/SPARK-11130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Filing so it doesn't get lost (again). > TestHive.scala has this code: > {code} > new SparkContext( > System.getProperty("spark.sql.test.master", "local[32]"), > {code} > On machines with less cores, that causes many tests to fail with "unable to > allocate memory" errors, because the default page size calculation seems to > be based on the machine's core count, and not on the core count specified for > the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10217) Spark SQL cannot handle ordering directive in ORDER BY clauses with expressions
[ https://issues.apache.org/jira/browse/SPARK-10217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10217. --- Resolution: Cannot Reproduce > Spark SQL cannot handle ordering directive in ORDER BY clauses with > expressions > --- > > Key: SPARK-10217 > URL: https://issues.apache.org/jira/browse/SPARK-10217 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > Labels: SQL, analyzers > > Spark SQL supports expressions in ORDER BY clauses, e.g., > {code} > scala> sqlContext.sql("select cnt from test order by (cnt + cnt)") > res2: org.apache.spark.sql.DataFrame = [cnt: bigint] > {code} > However, the analyzer gets confused when there is an explicit ordering > directive (ASC/DESC): > {code} > scala> sqlContext.sql("select cnt from test order by (cnt + cnt) asc") > 15/08/25 04:08:02 INFO ParseDriver: Parsing command: select cnt from test > order by (cnt + cnt) asc > org.apache.spark.sql.AnalysisException: extraneous input 'asc' expecting EOF > near ''; line 1 pos 40 > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:289) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11131) Worker registration protocol is racy
Marcelo Vanzin created SPARK-11131: -- Summary: Worker registration protocol is racy Key: SPARK-11131 URL: https://issues.apache.org/jira/browse/SPARK-11131 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Marcelo Vanzin Priority: Minor I ran into this while making changes to the new RPC framework. Because the Worker registration protocol is based on sending unrelated messages between Master and Worker, it's possible that another message (e.g. caused by an a app trying to allocate workers) to arrive at the Worker before it knows the Master has registered it. This triggers the following code: {code} case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => if (masterUrl != activeMasterUrl) { logWarning("Invalid Master (" + masterUrl + ") attempted to launch executor.") {code} This may or may not be made worse by SPARK-11098. A simple workaround is to use an {{ask}} instead of a {{send}} for these messages. That should at least narrow the race. Note this is more of a problem in {{local-cluster}} mode, used a lot by unit tests, where Master and Worker instances are coming up as part of the app itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter
[ https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11066. --- Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9076 [https://github.com/apache/spark/pull/9076] > Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler > occasionally fails due to j.l.UnsupportedOperationException concerning a > finished JobWaiter > -- > > Key: SPARK-11066 > URL: https://issues.apache.org/jira/browse/SPARK-11066 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, Tests >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: Multiple OS and platform types. > (Also observed by others, e.g. see External URL) >Reporter: Dr Stephen A Hellberg >Priority: Minor > Fix For: 1.6.0, 1.5.2 > > > The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent > problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, > but whilst the job will fail and a SparkDriverExecutionException will be > returned, a race condition exists as to whether the first task's > (deliberately) thrown exception causes the job to fail - and having its > causing exception set to the DAGSchedulerSuiteDummyException that was thrown > as the setup of the misbehaving test - or second (and subsequent) tasks who > equally end, but have instead the DAGScheduler's legitimate > UnsupportedOperationException (a subclass of RuntimeException) returned > instead as their causing exception. This race condition is likely associated > with the vagaries of processing quanta, and expense of throwing two > exceptions (under interpreter execution) per thread of control; this race is > usually 'won' by the first task throwing the DAGSchedulerDummyException, as > desired (and expected)... but not always. > The problem for the testcase is that the first assertion is largely > concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest > expert) capture all the causes of SparkDriverExecutionException that can > legitimately arise from a correctly working (not crashed) DAGScheduler. > Arguably, this assertion might test something of the DAGScheduler... but not > all the possible outcomes for a working DAGScheduler. Nevertheless, this > test - when comprising a multiple task job - will report as a failure when in > fact the DAGScheduler is working-as-designed (and not crashed ;-). > Furthermore, the test is already failed before it actually tries to use the > SparkContext a second time (for an arbitrary processing task), which I think > is the real subject of the test? > The solution, I submit, is to ensure that the job is composed of just one > task, and that single task will result in the call to the compromised > ResultHandler causing the test's deliberate exception to be thrown and > exercising the relevant (DAGScheduler) code paths. Given tasks are scoped by > the number of partitions of an RDD, this could be achieved with a single > partitioned RDD (indeed, doing so seems to exercise/would test some default > parallelism support of the TaskScheduler?); the pull request offered, > however, is based on the minimal change of just using a single partition of > the 2 (or more) partition parallelized RDD. This will result in scheduling a > job of just one task, one successful task calling the user-supplied > compromised ResultHandler function, which results in failing the job and > unambiguously wrapping our DAGSchedulerSuiteException inside a > SparkDriverExecutionException; there are no other tasks that on running > successfully will find the job failed causing the 'undesired' > UnsupportedOperationException to be thrown instead. This, then, satisfies > the test's setup assertion. > I have tested this hypothesis having parametised the number of partitions, N, > used by the "misbehaved ResultHandler" job and have observed the 1 x > DAGSchedulerSuiteException first, followed by the legitimate N-1 x > UnsupportedOperationExceptions ... what propagates back from the job seems to > simply become the result of the race between task threads and the > intermittent failures observed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter
[ https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11066: -- Assignee: Dr Stephen A Hellberg > Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler > occasionally fails due to j.l.UnsupportedOperationException concerning a > finished JobWaiter > -- > > Key: SPARK-11066 > URL: https://issues.apache.org/jira/browse/SPARK-11066 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, Tests >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: Multiple OS and platform types. > (Also observed by others, e.g. see External URL) >Reporter: Dr Stephen A Hellberg >Assignee: Dr Stephen A Hellberg >Priority: Minor > Fix For: 1.5.2, 1.6.0 > > > The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent > problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, > but whilst the job will fail and a SparkDriverExecutionException will be > returned, a race condition exists as to whether the first task's > (deliberately) thrown exception causes the job to fail - and having its > causing exception set to the DAGSchedulerSuiteDummyException that was thrown > as the setup of the misbehaving test - or second (and subsequent) tasks who > equally end, but have instead the DAGScheduler's legitimate > UnsupportedOperationException (a subclass of RuntimeException) returned > instead as their causing exception. This race condition is likely associated > with the vagaries of processing quanta, and expense of throwing two > exceptions (under interpreter execution) per thread of control; this race is > usually 'won' by the first task throwing the DAGSchedulerDummyException, as > desired (and expected)... but not always. > The problem for the testcase is that the first assertion is largely > concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest > expert) capture all the causes of SparkDriverExecutionException that can > legitimately arise from a correctly working (not crashed) DAGScheduler. > Arguably, this assertion might test something of the DAGScheduler... but not > all the possible outcomes for a working DAGScheduler. Nevertheless, this > test - when comprising a multiple task job - will report as a failure when in > fact the DAGScheduler is working-as-designed (and not crashed ;-). > Furthermore, the test is already failed before it actually tries to use the > SparkContext a second time (for an arbitrary processing task), which I think > is the real subject of the test? > The solution, I submit, is to ensure that the job is composed of just one > task, and that single task will result in the call to the compromised > ResultHandler causing the test's deliberate exception to be thrown and > exercising the relevant (DAGScheduler) code paths. Given tasks are scoped by > the number of partitions of an RDD, this could be achieved with a single > partitioned RDD (indeed, doing so seems to exercise/would test some default > parallelism support of the TaskScheduler?); the pull request offered, > however, is based on the minimal change of just using a single partition of > the 2 (or more) partition parallelized RDD. This will result in scheduling a > job of just one task, one successful task calling the user-supplied > compromised ResultHandler function, which results in failing the job and > unambiguously wrapping our DAGSchedulerSuiteException inside a > SparkDriverExecutionException; there are no other tasks that on running > successfully will find the job failed causing the 'undesired' > UnsupportedOperationException to be thrown instead. This, then, satisfies > the test's setup assertion. > I have tested this hypothesis having parametised the number of partitions, N, > used by the "misbehaved ResultHandler" job and have observed the 1 x > DAGSchedulerSuiteException first, followed by the legitimate N-1 x > UnsupportedOperationExceptions ... what propagates back from the job seems to > simply become the result of the race between task threads and the > intermittent failures observed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11132) Mean Shift algorithm integration
Beck Gaël created SPARK-11132: - Summary: Mean Shift algorithm integration Key: SPARK-11132 URL: https://issues.apache.org/jira/browse/SPARK-11132 Project: Spark Issue Type: Brainstorming Components: MLlib Reporter: Beck Gaël Priority: Minor I made a version of the clustering algorithm Mean Shift in scala/Spark and would like to contribute if you think that it is a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11132) Mean Shift algorithm integration
[ https://issues.apache.org/jira/browse/SPARK-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959245#comment-14959245 ] Sean Owen commented on SPARK-11132: --- [~Kybe] please have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark regarding if and when algos are integrated into MLlib. What's the case for mean shift? > Mean Shift algorithm integration > > > Key: SPARK-11132 > URL: https://issues.apache.org/jira/browse/SPARK-11132 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Beck Gaël >Priority: Minor > > I made a version of the clustering algorithm Mean Shift in scala/Spark and > would like to contribute if you think that it is a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11047) Internal accumulators miss the internal flag when replaying events in the history server
[ https://issues.apache.org/jira/browse/SPARK-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11047. - Resolution: Fixed Assignee: Carson Wang Fix Version/s: 1.6.0 1.5.2 > Internal accumulators miss the internal flag when replaying events in the > history server > > > Key: SPARK-11047 > URL: https://issues.apache.org/jira/browse/SPARK-11047 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Critical > Fix For: 1.5.2, 1.6.0 > > > Internal accumulators don't write the internal flag to event log. So on the > history server Web UI, all accumulators are not internal. This causes > incorrect peak execution memory and unwanted accumulator table displayed on > the stage page. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959273#comment-14959273 ] Ragu Ramaswamy commented on SPARK-4105: --- I get this error consistently when using spark-shell on 1.5.1 (win 7) {code} scala> sc.textFile("README.md", 1).flatMap(x => x.split(" ")).countByValue() {code} Happens for any {code}countByValue/groupByKey/reduceByKey{code} operations. The affected versions tag on this issue mentions 1.2.0, 1.2.1, 1.3.0, 1.4.1 but not 1.5.1 Can someone help me if I am doing something wrong or is this a problem in 1.5.1 also > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.sch
[jira] [Assigned] (SPARK-10186) Add support for more postgres column types
[ https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10186: Assignee: Apache Spark > Add support for more postgres column types > -- > > Key: SPARK-10186 > URL: https://issues.apache.org/jira/browse/SPARK-10186 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov >Assignee: Apache Spark > > The specific observations below are based on Postgres 9.4 tables accessed via > the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I > would expect the problem to exists for all external SQL databases. > - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type > }}*. While it is reasonable to not support dynamic schema discovery of > JSON columns automatically (it requires two passes over the data), a better > behavior would be to create a String column and return the JSON. > - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. > This is true even for simple types, e.g., {{text[]}}. A better behavior would > be be create an Array column. > - *Custom type columns are mapped to a String column.* This behavior is > harder to understand as the schema of a custom type is fixed and therefore > mappable to a Struct column. The automatic conversion to a string is also > inconsistent when compared to json and array column handling. > The exceptions are thrown by > {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}} > so this definitely looks like a Spark SQL and not a JDBC problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10186) Add support for more postgres column types
[ https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10186: Assignee: (was: Apache Spark) > Add support for more postgres column types > -- > > Key: SPARK-10186 > URL: https://issues.apache.org/jira/browse/SPARK-10186 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > > The specific observations below are based on Postgres 9.4 tables accessed via > the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I > would expect the problem to exists for all external SQL databases. > - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type > }}*. While it is reasonable to not support dynamic schema discovery of > JSON columns automatically (it requires two passes over the data), a better > behavior would be to create a String column and return the JSON. > - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. > This is true even for simple types, e.g., {{text[]}}. A better behavior would > be be create an Array column. > - *Custom type columns are mapped to a String column.* This behavior is > harder to understand as the schema of a custom type is fixed and therefore > mappable to a Struct column. The automatic conversion to a string is also > inconsistent when compared to json and array column handling. > The exceptions are thrown by > {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}} > so this definitely looks like a Spark SQL and not a JDBC problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10186) Add support for more postgres column types
[ https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959289#comment-14959289 ] Apache Spark commented on SPARK-10186: -- User 'mariusvniekerk' has created a pull request for this issue: https://github.com/apache/spark/pull/9137 > Add support for more postgres column types > -- > > Key: SPARK-10186 > URL: https://issues.apache.org/jira/browse/SPARK-10186 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > > The specific observations below are based on Postgres 9.4 tables accessed via > the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I > would expect the problem to exists for all external SQL databases. > - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type > }}*. While it is reasonable to not support dynamic schema discovery of > JSON columns automatically (it requires two passes over the data), a better > behavior would be to create a String column and return the JSON. > - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. > This is true even for simple types, e.g., {{text[]}}. A better behavior would > be be create an Array column. > - *Custom type columns are mapped to a String column.* This behavior is > harder to understand as the schema of a custom type is fixed and therefore > mappable to a Struct column. The automatic conversion to a string is also > inconsistent when compared to json and array column handling. > The exceptions are thrown by > {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}} > so this definitely looks like a Spark SQL and not a JDBC problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map
[ https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959298#comment-14959298 ] Karl D. Gierach commented on SPARK-5739: Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation. Also, above this issue's state is "resolved" but I'm not sure what the resolution is? > Size exceeds Integer.MAX_VALUE in File Map > -- > > Key: SPARK-5739 > URL: https://issues.apache.org/jira/browse/SPARK-5739 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.1 > Environment: Spark1.1.1 on a cluster with 12 node. Every node with > 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a > node. >Reporter: DjvuLee >Priority: Minor > > I just run the kmeans algorithm using a random generate data,but occurred > this problem after some iteration. I try several time, and this problem is > reproduced. > Because the data is random generate, so I guess is there a bug ? Or if random > data can lead to such a scenario that the size is bigger than > Integer.MAX_VALUE, can we check the size before using the file map? > 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.util.SizeEstimator - Failed to check whether > UseCompressedOops is set; assuming yes > [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598) > at > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270) > at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348) > at KMeansDataGenerator$.main(kmeans.scala:105) > at KMeansDataGenerator.main(kmeans.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) > at java.lang.reflect.Method.invoke(Method.java:619) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10943: - Description: {code} var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null as comments") {code} //FAIL - Try writing a NullType column (where all the values are NULL) {code} data02.write.parquet("/tmp/test/dataset2") at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in stage 179.0 (TID 39924, 10.0.196.208): org.apache.spark.sql.AnalysisException: Unsupported data type StructField(comments,NullType,true).dataType; at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.sql.types.StructType.map(StructType.scala:92) at org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) at org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) at org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94) at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRe
[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959304#comment-14959304 ] Michael Armbrust commented on SPARK-10943: -- Yeah, parquet doesn't have a concept of null type. I'd probably suggest they case null to a type {{CAST(NULL AS INT)}} if they really want to do this, but really you should just omit the column probably. > NullType Column cannot be written to Parquet > > > Key: SPARK-10943 > URL: https://issues.apache.org/jira/browse/SPARK-10943 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > {code} > var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null > as comments") > {code} > //FAIL - Try writing a NullType column (where all the values are NULL) > {code} > data02.write.parquet("/tmp/test/dataset2") > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 179.0 (TID 39924, 10.0.196.208): > org.apache.spark.sql.AnalysisException: Unsupported data type > StructField(comments,NullType,true).dataType; > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at org.apache.spark.sql.types.StructType.map(StructType.scala:92) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOutpu
[jira] [Commented] (SPARK-11058) failed spark job reports on YARN as successful
[ https://issues.apache.org/jira/browse/SPARK-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959320#comment-14959320 ] Lan Jiang commented on SPARK-11058: --- Sean, I recreated this problem after I triggered some exceptions in my tasks on purpose. The resource manager UI reports the final status to be "succeed" but the job shows up in the "incomplete list" on the spark history server. I do see the exception thrown by the the driver. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) As to the second possibility, it might be true. I cannot test the scenario on the existing cluster. I need to launch a new cluster to test it. Will report back Lan > failed spark job reports on YARN as successful > -- > > Key: SPARK-11058 > URL: https://issues.apache.org/jira/browse/SPARK-11058 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 > Environment: CDH 5.4 >Reporter: Lan Jiang >Priority: Minor > > I have a spark batch job running on CDH5.4 + Spark 1.3.0. Job is submitted in > “yarn-client” mode. The job itself failed due to YARN kills several executor > containers because the containers exceeded the memory limit posed by YARN. > However, when I went to the YARN resource manager site, it displayed the job > as successful. I found there was an issue reported in JIRA > https://issues.apache.org/jira/browse/SPARK-3627, but it says it was fixed in > Spark 1.2. On Spark history server, it shows the job as “Incomplete”. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959342#comment-14959342 ] Zhan Zhang commented on SPARK-11087: [~patcharee] I tried a simple case with partition and predicate pushdown, and didn't hit the problem. The predicate is pushdown correctly. I will try to use your same table to see whether it works. 2501 case class Contact(name: String, phone: String) 2502 case class Person(name: String, age: Int, contacts: Seq[Contact]) 2503 val records = (1 to 100).map { i =>; 2504 Person(s"name_$i", i, (0 to 1).map { m => Contact(s"contact_$m", s"phone_$m") }) 2505 } 2506 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2507 sc.parallelize(records).toDF().write.format("orc").partitionBy("age").save("peoplePartitioned") 2508 val peoplePartitioned = sqlContext.read.format("orc").load("peoplePartitioned") 2509 peoplePartitioned.registerTempTable("peoplePartitioned") 2510sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20 and name = 'name_20'").count 2511 :history 2512sqlContext.sql("SELECT * FROM peoplePartitioned WHERE name = 'name_20' and age = 20").count 2513 :history scala> 2015-10-15 10:40:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (LESS_THAN age 15) expr = leaf-0 2015-10-15 10:48:20 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS name name_20) expr = leaf-0 sqlContext.sql("SELECT name FROM people WHERE age == 15 and age < 16").count() 2015-10-15 10:58:35 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS age 15) leaf-1 = (LESS_THAN age 16) sqlContext.sql("SELECT name FROM people WHERE age < 15").count() > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee
[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959347#comment-14959347 ] Michael Armbrust commented on SPARK-: - Yeah, that Scala code should work. Regarding the Java version, the only difference is the API I have in mind would be {{Encoder.for(MyClass2.class)}}. Passing in an encoder instead of a raw {{Class[_]}} gives us some extra indirection in case we want to support custom encoders some day. I'll add that we can also play reflection tricks in cases where things are not erased for Java, and this is the part of the proposal that is the least thought out at the moment. Any help making this part as powerful/robust as possible would be greatly appreciated. I think that is possible that in the long term we will do as you propose and remake the RDD API as a compatibility layer with the option to infer the encoder based on the class tag. The problem with this being the primary implementation is erasure. {code} scala> import scala.reflect._ scala> classTag[(Int, Int)].erasure.getTypeParameters res0: Array[java.lang.reflect.TypeVariable[Class[_$1]]] forSome { type _$1 } = Array(T1, T2) {code} We've lost the type of {{_1}} and {{_2}} and so we are going to have to fall back on runtime reflection again, per tuple. Where as the encoders that are checked into master could extract primitive int without any additional boxing and encode them directly into tungsten buffers. > RDD-like API on top of Catalyst/DataFrame > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10943. -- Resolution: Won't Fix > NullType Column cannot be written to Parquet > > > Key: SPARK-10943 > URL: https://issues.apache.org/jira/browse/SPARK-10943 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > {code} > var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null > as comments") > {code} > //FAIL - Try writing a NullType column (where all the values are NULL) > {code} > data02.write.parquet("/tmp/test/dataset2") > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 179.0 (TID 39924, 10.0.196.208): > org.apache.spark.sql.AnalysisException: Unsupported data type > StructField(comments,NullType,true).dataType; > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at org.apache.spark.sql.types.StructType.map(StructType.scala:92) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.parquet.Parque
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959357#comment-14959357 ] Joseph K. Bradley commented on SPARK-5874: -- That sounds useful, but we should add that support to individual models first before we make it a part of the Estimator abstraction. Only a few models have it currently, so if there are ones you'd prioritize, it'd be great to get your help in adding support. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9919) Matrices should respect Java's equals and hashCode contract
[ https://issues.apache.org/jira/browse/SPARK-9919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9919: - Assignee: (was: Manoj Kumar) > Matrices should respect Java's equals and hashCode contract > --- > > Key: SPARK-9919 > URL: https://issues.apache.org/jira/browse/SPARK-9919 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Feynman Liang >Priority: Critical > > The contract for Java's Object is that a.equals(b) implies a.hashCode == > b.hashCode. So usually we need to implement both. The problem with hashCode > is that we shouldn't compute it based on all values, which could be very > expensive. You can use the implementation of Vector.hashCode as a template, > but that requires some changes to avoid hash code collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11131) Worker registration protocol is racy
[ https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959378#comment-14959378 ] Apache Spark commented on SPARK-11131: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/9138 > Worker registration protocol is racy > > > Key: SPARK-11131 > URL: https://issues.apache.org/jira/browse/SPARK-11131 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I ran into this while making changes to the new RPC framework. Because the > Worker registration protocol is based on sending unrelated messages between > Master and Worker, it's possible that another message (e.g. caused by an a > app trying to allocate workers) to arrive at the Worker before it knows the > Master has registered it. This triggers the following code: > {code} > case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => > if (masterUrl != activeMasterUrl) { > logWarning("Invalid Master (" + masterUrl + ") attempted to launch > executor.") > {code} > This may or may not be made worse by SPARK-11098. > A simple workaround is to use an {{ask}} instead of a {{send}} for these > messages. That should at least narrow the race. > Note this is more of a problem in {{local-cluster}} mode, used a lot by unit > tests, where Master and Worker instances are coming up as part of the app > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy
[ https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11131: Assignee: Apache Spark > Worker registration protocol is racy > > > Key: SPARK-11131 > URL: https://issues.apache.org/jira/browse/SPARK-11131 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > I ran into this while making changes to the new RPC framework. Because the > Worker registration protocol is based on sending unrelated messages between > Master and Worker, it's possible that another message (e.g. caused by an a > app trying to allocate workers) to arrive at the Worker before it knows the > Master has registered it. This triggers the following code: > {code} > case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => > if (masterUrl != activeMasterUrl) { > logWarning("Invalid Master (" + masterUrl + ") attempted to launch > executor.") > {code} > This may or may not be made worse by SPARK-11098. > A simple workaround is to use an {{ask}} instead of a {{send}} for these > messages. That should at least narrow the race. > Note this is more of a problem in {{local-cluster}} mode, used a lot by unit > tests, where Master and Worker instances are coming up as part of the app > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11131) Worker registration protocol is racy
[ https://issues.apache.org/jira/browse/SPARK-11131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11131: Assignee: (was: Apache Spark) > Worker registration protocol is racy > > > Key: SPARK-11131 > URL: https://issues.apache.org/jira/browse/SPARK-11131 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I ran into this while making changes to the new RPC framework. Because the > Worker registration protocol is based on sending unrelated messages between > Master and Worker, it's possible that another message (e.g. caused by an a > app trying to allocate workers) to arrive at the Worker before it knows the > Master has registered it. This triggers the following code: > {code} > case LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) => > if (masterUrl != activeMasterUrl) { > logWarning("Invalid Master (" + masterUrl + ") attempted to launch > executor.") > {code} > This may or may not be made worse by SPARK-11098. > A simple workaround is to use an {{ask}} instead of a {{send}} for these > messages. That should at least narrow the race. > Note this is more of a problem in {{local-cluster}} mode, used a lot by unit > tests, where Master and Worker instances are coming up as part of the app > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map
[ https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959298#comment-14959298 ] Karl D. Gierach edited comment on SPARK-5739 at 10/15/15 7:06 PM: -- Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation. Also, above this issue's state is "resolved" but I'm not sure what the resolution is? Maybe a state of "closed" with a reference to the duplicate ticket would make it more clear. was (Author: kgierach): Is there anyway to increase this block limit? I'm hitting the same issue during a UnionRDD operation. Also, above this issue's state is "resolved" but I'm not sure what the resolution is? > Size exceeds Integer.MAX_VALUE in File Map > -- > > Key: SPARK-5739 > URL: https://issues.apache.org/jira/browse/SPARK-5739 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.1.1 > Environment: Spark1.1.1 on a cluster with 12 node. Every node with > 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a > node. >Reporter: DjvuLee >Priority: Minor > > I just run the kmeans algorithm using a random generate data,but occurred > this problem after some iteration. I try several time, and this problem is > reproduced. > Because the data is random generate, so I guess is there a bug ? Or if random > data can lead to such a scenario that the size is bigger than > Integer.MAX_VALUE, can we check the size before using the file map? > 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN > org.apache.spark.util.SizeEstimator - Failed to check whether > UseCompressedOops is set; assuming yes > [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140) > at > org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598) > at > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270) > at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338) > at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348) > at KMeansDataGenerator$.main(kmeans.scala:105) > at KMeansDataGenerator.main(kmeans.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) > at java.lang.reflect.Method.invoke(Method.java:619) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959440#comment-14959440 ] Reynold Xin commented on SPARK-9241: Do we have any idea on performance characteristics of this rewrite? IIUC, grouping set's complexity grows exponentially with the number of items in the set? > Supporting multiple DISTINCT columns > > > Key: SPARK-9241 > URL: https://issues.apache.org/jira/browse/SPARK-9241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now the new aggregation code path only support a single distinct column > (you can use it in multiple aggregate functions in the query). We need to > support multiple distinct columns by generating a different plan for handling > multiple distinct columns (without change aggregate functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8658) AttributeReference equals method only compare name, exprId and dataType
[ https://issues.apache.org/jira/browse/SPARK-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959451#comment-14959451 ] Michael Armbrust commented on SPARK-8658: - There is no query that exposes the problem as its an internal quirk. The {{equals}} method should check all of the specified fields for equality. Today it is missing some. > AttributeReference equals method only compare name, exprId and dataType > --- > > Key: SPARK-8658 > URL: https://issues.apache.org/jira/browse/SPARK-8658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.0 >Reporter: Antonio Jesus Navarro > > The AttributeReference "equals" method only accept as different objects with > different name, expression id or dataType. With this behavior when I tried to > do a "transformExpressionsDown" and try to transform qualifiers inside > "AttributeReferences", these objects are not replaced, because the > transformer considers them equal. > I propose to add to the "equals" method this variables: > name, dataType, nullable, metadata, epxrId, qualifiers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11039) Document all UI "retained*" configurations
[ https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11039. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9052 [https://github.com/apache/spark/pull/9052] > Document all UI "retained*" configurations > -- > > Key: SPARK-11039 > URL: https://issues.apache.org/jira/browse/SPARK-11039 > Project: Spark > Issue Type: Documentation > Components: Documentation, Web UI >Affects Versions: 1.5.1 >Reporter: Nick Pritchard >Priority: Trivial > Fix For: 1.6.0, 1.5.2 > > > Most are documented except these: > - spark.sql.ui.retainedExecutions > - spark.streaming.ui.retainedBatches > They are really helpful for managing the memory usage of the driver > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11039) Document all UI "retained*" configurations
[ https://issues.apache.org/jira/browse/SPARK-11039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11039: --- Assignee: Nick Pritchard > Document all UI "retained*" configurations > -- > > Key: SPARK-11039 > URL: https://issues.apache.org/jira/browse/SPARK-11039 > Project: Spark > Issue Type: Documentation > Components: Documentation, Web UI >Affects Versions: 1.5.1 >Reporter: Nick Pritchard >Assignee: Nick Pritchard >Priority: Trivial > Fix For: 1.5.2, 1.6.0 > > > Most are documented except these: > - spark.sql.ui.retainedExecutions > - spark.streaming.ui.retainedBatches > They are really helpful for managing the memory usage of the driver > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5657) Add PySpark Avro Output Format example
[ https://issues.apache.org/jira/browse/SPARK-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5657. --- Resolution: Won't Fix > Add PySpark Avro Output Format example > -- > > Key: SPARK-5657 > URL: https://issues.apache.org/jira/browse/SPARK-5657 > Project: Spark > Issue Type: Improvement > Components: Examples, PySpark >Affects Versions: 1.2.0 >Reporter: Stanislav Los > > There is an Avro Input Format example that shows how to read Avro data in > PySpark, but nothing shows how to write from PySpark to Avro. The main > challenge, a Converter needs an Avro schema to build a record, but current > Spark API doesn't provide a way to supply extra parameters to custom > converters. Provided workaround is possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6488: --- Assignee: Apache Spark (was: Mike Dusenberry) > Support addition/multiplication in PySpark's BlockMatrix > > > Key: SPARK-6488 > URL: https://issues.apache.org/jira/browse/SPARK-6488 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Apache Spark > > This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We > should reuse the Scala implementation instead of having a separate > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959534#comment-14959534 ] Apache Spark commented on SPARK-6488: - User 'dusenberrymw' has created a pull request for this issue: https://github.com/apache/spark/pull/9139 > Support addition/multiplication in PySpark's BlockMatrix > > > Key: SPARK-6488 > URL: https://issues.apache.org/jira/browse/SPARK-6488 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Mike Dusenberry > > This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We > should reuse the Scala implementation instead of having a separate > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix
[ https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6488: --- Assignee: Mike Dusenberry (was: Apache Spark) > Support addition/multiplication in PySpark's BlockMatrix > > > Key: SPARK-6488 > URL: https://issues.apache.org/jira/browse/SPARK-6488 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Mike Dusenberry > > This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We > should reuse the Scala implementation instead of having a separate > implementation in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959564#comment-14959564 ] Xusen Yin commented on SPARK-5874: -- I'd love to add supports to individual models first. But since there are many estimators in ML package now, I think we'd better add an umbrella JIRA to control the process. Can I create new JIRA subtask under this JIRA? > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571 ] Pratik Khadloya commented on SPARK-2984: Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.
[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571 ] Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:40 PM: -- Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Also, i am not running in speculative mode. {code} .set("spark.speculation", "false") {code} was (Author: tispratik): Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Also, i am not running in speculative mode. .set("spark.speculation", "false") > FileNotFoundException on _temporary directory > ---
[jira] [Comment Edited] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959571#comment-14959571 ] Pratik Khadloya edited comment on SPARK-2984 at 10/15/15 8:39 PM: -- Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} Also, i am not running in speculative mode. .set("spark.speculation", "false") was (Author: tispratik): Am seeing the same issue on Spark 1.4.1. I am trying to save a dataframe as table ( saveAsTable ) using SaveMode.Overwrite. {code} 15/10/15 16:19:57 INFO hadoop.ColumnChunkPageWriteStore: written 1,508B for [flight_id] INT64: 1,142 values, 1,441B raw, 1,464B comp, 1 pages, encodings: [BIT_PACKED, RLE, PLAIN_DICTIONARY], dic { 563 entries, 4,504B raw, 563B comp} 15/10/15 16:19:57 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /warehouse/hive-user-tables/agg_imps_pratik/_temporary/0/_temporary/attempt_201510151447_0002_m_23_0/part-r-00023-6778754e-ac4d-44ef-8ee8-fc87e89639bc.gz.parquet (inode 2376521862): File does not exist. Holder DFSClient_attempt_201510151447_0002_m_23_0_529613711_162 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3083) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2885) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2767) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:606) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:455) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issue
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959592#comment-14959592 ] Xusen Yin commented on SPARK-5874: -- Sure I'll do it. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599 ] Zhan Zhang commented on SPARK-11087: [~patcharee] I try to duplicate your table as much as possible, but still didn't hit the problem. Please refer to the below for the details. case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int) val records = (1 to 100).map { i => record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt) } sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D") sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D") val test = sqlContext.read.format("orc").load("4D") 2503 test.registerTempTable("4D") 2504 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2505 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 320) leaf-1 = (EQUALS y 117) expr = (and leaf-0 leaf-1) 2507 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 321) leaf-1 = (EQUALS y 118) expr = (and leaf-0 leaf-1) > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int
[jira] [Comment Edited] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959599#comment-14959599 ] Zhan Zhang edited comment on SPARK-11087 at 10/15/15 8:58 PM: -- [~patcharee] I try to duplicate your table as much as possible, but still didn't hit the problem. Note that the query has to include some valid record in the partition. Otherwise, the partition pruning will trim all predicate before hitting the orc scan. Please refer to the below for the details. case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int) val records = (1 to 100).map { i => record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt) } sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D") sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D") val test = sqlContext.read.format("orc").load("4D") test.registerTempTable("4D") sqlContext.setConf("spark.sql.orc.filterPushdown", "true") sqlContext.setConf("spark.sql.orc.filterPushdown", "true") sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 4D where x = and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 320) leaf-1 = (EQUALS y 117) expr = (and leaf-0 leaf-1) 2507 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 321) leaf-1 = (EQUALS y 118) expr = (and leaf-0 leaf-1) was (Author: zzhan): [~patcharee] I try to duplicate your table as much as possible, but still didn't hit the problem. Please refer to the below for the details. case class record(date: Int, hh: Int, x: Int, y: Int, height: Float, u: Float, w: Float, ph: Float, phb: Float, t: Float, p: Float, pb: Float, tke_pbl: Float, el_pbl: Float, qcloud: Float, zone: Int, z: Int, year: Int, month: Int) val records = (1 to 100).map { i => record(i.toInt, i.toInt, i.toInt, i.toInt, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toFloat, i.toInt, i.toInt, i.toInt, i.toInt) } sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("5D") sc.parallelize(records).toDF().write.format("org.apache.spark.sql.hive.orc.DefaultSource").partitionBy("zone","z","year","month").save("4D") val test = sqlContext.read.format("orc").load("4D") 2503 test.registerTempTable("4D") 2504 sqlContext.setConf("spark.sql.orc.filterPushdown", "true") 2505 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 4D where x = 320 and y = 117 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:37:45 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 320) leaf-1 = (EQUALS y 117) expr = (and leaf-0 leaf-1) 2507 sqlContext.sql("select date, month, year, hh, u*0.9122461, u*-0.40964267, z from 5D where x = 321 and y = 118 and zone == 2 and year=2 and z >= 2 and z <= 8").show 2015-10-15 13:40:06 OrcInputFormat [INFO] ORC pushdown predicate: leaf-0 = (EQUALS x 321) leaf-1 = (EQUALS y 118) expr = (and leaf-0 leaf-1) > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected
[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns
[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959604#comment-14959604 ] Herman van Hovell commented on SPARK-9241: -- It should grow linear (or am I missing something). For example if we have 3 grouping sets (like in the example), we would duplicate and project the data 3x times. It is still bad, but similar to the approach in [~yhuai]'s example (saving a join). We could have a problem with the {{GROUPING__ID}} bitmask field, only 32/64 fields can be in a grouping set. > Supporting multiple DISTINCT columns > > > Key: SPARK-9241 > URL: https://issues.apache.org/jira/browse/SPARK-9241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Right now the new aggregation code path only support a single distinct column > (you can use it in multiple aggregate functions in the query). We need to > support multiple distinct columns by generating a different plan for handling > multiple distinct columns (without change aggregate functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11133: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11133 > URL: https://issues.apache.org/jira/browse/SPARK-11133 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: Expected exception caused by connection timeout. > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite
Andrew Or created SPARK-11133: - Summary: Flaky test: o.a.s.launcher.LauncherServerSuite Key: SPARK-11133 URL: https://issues.apache.org/jira/browse/SPARK-11133 Project: Spark Issue Type: Bug Components: Tests Reporter: Andrew Or Priority: Critical {code} sbt.ForkMain$ForkError: Expected exception caused by connection timeout. at org.junit.Assert.fail(Assert.java:88) at org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) {code} https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite
Andrew Or created SPARK-11134: - Summary: Flaky test: o.a.s.launcher.LauncherBackendSuite Key: SPARK-11134 URL: https://issues.apache.org/jira/browse/SPARK-11134 Project: Spark Issue Type: Bug Components: Tests Reporter: Andrew Or Priority: Critical {code} sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 110 times over 10.042591494 seconds. Last failure message: The reference was null. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57) at org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39) at org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) at org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) {code} https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11133) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11133. Resolution: Duplicate > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11133 > URL: https://issues.apache.org/jira/browse/SPARK-11133 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: Expected exception caused by connection timeout. > at org.junit.Assert.fail(Assert.java:88) > at > org.apache.spark.launcher.LauncherServerSuite.testTimeout(LauncherServerSuite.java:140) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3769/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherServerSuite/testTimeout/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959589#comment-14959589 ] Joseph K. Bradley commented on SPARK-5874: -- Sure, that sounds good. Can you also please search for existing tickets and link them to the umbrella? > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11134) Flaky test: o.a.s.launcher.LauncherBackendSuite
[ https://issues.apache.org/jira/browse/SPARK-11134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11134: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.launcher.LauncherBackendSuite > --- > > Key: SPARK-11134 > URL: https://issues.apache.org/jira/browse/SPARK-11134 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Andrew Or >Priority: Critical > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: The code passed to eventually never returned > normally. Attempted 110 times over 10.042591494 seconds. Last failure > message: The reference was null. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) > at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) > at > org.apache.spark.launcher.LauncherBackendSuite.org$apache$spark$launcher$LauncherBackendSuite$$testWithMaster(LauncherBackendSuite.scala:57) > at > org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply$mcV$sp(LauncherBackendSuite.scala:39) > at > org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) > at > org.apache.spark.launcher.LauncherBackendSuite$$anonfun$1$$anonfun$apply$1.apply(LauncherBackendSuite.scala:39) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/3768/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.launcher/LauncherBackendSuite/local__launcher_handle/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11071: -- Component/s: (was: Spark Core) Tests > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11071: -- Labels: flaky-test (was: ) > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11071: -- Summary: Flaky test: o.a.s.launcher.LauncherServerSuite (was: LauncherServerSuite::testTimeout is flaky) > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11135: --- Description: In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. This bug was introduced in https://github.com/apache/spark/pull/7458, so it affects 1.5.0+. was: In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. > Exchange sort-planning logic may incorrect avoid sorts > -- > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11135) Exchange sort-planning logic may incorrect avoid sorts
Josh Rosen created SPARK-11135: -- Summary: Exchange sort-planning logic may incorrect avoid sorts Key: SPARK-11135 URL: https://issues.apache.org/jira/browse/SPARK-11135 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases where the data has already been sorted by a superset of the requested sorting columns. For instance, let's say that a query calls for an operator's input to be sorted by `a.asc` and the input happens to already be sorted by `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then `a.asc` alone will not satisfy the ordering requirements, requiring an additional sort to be planned by Exchange. However, the current Exchange code gets this wrong and incorrectly skips sorting when the existing output ordering is a subset of the required ordering. This is simple to fix, however. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11135: --- Summary: Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering (was: Exchange sort-planning logic may incorrect avoid sorts) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is subset of required ordering > -- > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11135: --- Summary: Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering (was: Exchange sort-planning logic incorrectly avoid sorts when existing ordering is subset of required ordering) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11071) Flaky test: o.a.s.launcher.LauncherServerSuite
[ https://issues.apache.org/jira/browse/SPARK-11071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-11071. --- Resolution: Fixed Fix Version/s: 1.6.0 Target Version/s: 1.6.0 > Flaky test: o.a.s.launcher.LauncherServerSuite > -- > > Key: SPARK-11071 > URL: https://issues.apache.org/jira/browse/SPARK-11071 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Labels: flaky-test > Fix For: 1.6.0 > > > This test has failed a few times on jenkins, e.g.: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/lastCompletedBuild/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10515) When killing executor, the pending replacement executors will be lost
[ https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10515. --- Resolution: Fixed Assignee: KaiXinXIaoLei Fix Version/s: 1.6.0 1.5.2 Target Version/s: 1.5.2, 1.6.0 > When killing executor, the pending replacement executors will be lost > - > > Key: SPARK-10515 > URL: https://issues.apache.org/jira/browse/SPARK-10515 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: KaiXinXIaoLei >Assignee: KaiXinXIaoLei > Fix For: 1.5.2, 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10412) In SQL tab, show execution memory per physical operator
[ https://issues.apache.org/jira/browse/SPARK-10412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-10412. --- Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 1.6.0 > In SQL tab, show execution memory per physical operator > --- > > Key: SPARK-10412 > URL: https://issues.apache.org/jira/browse/SPARK-10412 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Wenchen Fan > Fix For: 1.6.0 > > > We already display it per task / stage. It's really useful to also display it > per operator so the user can know which one caused all the memory to be > allocated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11136) Warm-start support for ML estimator
Xusen Yin created SPARK-11136: - Summary: Warm-start support for ML estimator Key: SPARK-11136 URL: https://issues.apache.org/jira/browse/SPARK-11136 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xusen Yin Priority: Minor The current implementation of Estimator does not support warm-start fitting, i.e. estimator.fit(data, params, partialModel). But first we need to add warm-start for all ML estimators. This is an umbrella JIRA to add support for the warm-start estimator. Possible solutions: 1. Add warm-start fitting interface like def fit(dataset: DataFrame, initModel: M, paramMap: ParamMap): M 2. Treat model as a special parameter, passing it through ParamMap. e.g. val partialModel: Param[Option[M]] = new Param(...). In the case of model existing, we use it to warm-start, else we start the training process from the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11135: Assignee: Apache Spark (was: Josh Rosen) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959735#comment-14959735 ] Apache Spark commented on SPARK-11135: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/9140 > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11135) Exchange sort-planning logic incorrectly avoid sorts when existing ordering is non-empty subset of required ordering
[ https://issues.apache.org/jira/browse/SPARK-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11135: Assignee: Josh Rosen (was: Apache Spark) > Exchange sort-planning logic incorrectly avoid sorts when existing ordering > is non-empty subset of required ordering > > > Key: SPARK-11135 > URL: https://issues.apache.org/jira/browse/SPARK-11135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > In Spark SQL, the Exchange planner tries to avoid unnecessary sorts in cases > where the data has already been sorted by a superset of the requested sorting > columns. For instance, let's say that a query calls for an operator's input > to be sorted by `a.asc` and the input happens to already be sorted by > `[a.asc, b.asc]`. In this case, we do not need to re-sort the input. The > converse, however, is not true: if the query calls for `[a.asc, b.asc]`, then > `a.asc` alone will not satisfy the ordering requirements, requiring an > additional sort to be planned by Exchange. > However, the current Exchange code gets this wrong and incorrectly skips > sorting when the existing output ordering is a subset of the required > ordering. This is simple to fix, however. > This bug was introduced in https://github.com/apache/spark/pull/7458, so it > affects 1.5.0+. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work
[ https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10829: -- Assignee: Cheng Hao > Scan DataSource with predicate expression combine partition key and > attributes doesn't work > --- > > Key: SPARK-10829 > URL: https://issues.apache.org/jira/browse/SPARK-10829 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Critical > Fix For: 1.6.0 > > > To reproduce that with the code: > {code} > withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { > withTempPath { dir => > val path = s"${dir.getCanonicalPath}/part=1" > (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path) > // If the "part = 1" filter gets pushed down, this query will throw > an exception since > // "part" is not a valid column in the actual Parquet file > checkAnswer( > sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > > 1)"), > (2 to 3).map(i => Row(i, i.toString, 1))) > } > } > {code} > We expect the result as: > {code} > 2, 1 > 3, 1 > {code} > But we got: > {code} > 1, 1 > 2, 1 > 3, 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe
[ https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5391: - Assignee: Davies Liu > SparkSQL fails to create tables with custom JSON SerDe > -- > > Key: SPARK-5391 > URL: https://issues.apache.org/jira/browse/SPARK-5391 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: David Ross >Assignee: Davies Liu > Fix For: 1.6.0 > > > - Using Spark built from trunk on this commit: > https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd > - Build for Hive13 > - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde > First download jar locally: > {code} > $ curl > http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar > > /tmp/json-serde-1.3-jar-with-dependencies.jar > {code} > Then add it in SparkSQL session: > {code} > add jar /tmp/json-serde-1.3-jar-with-dependencies.jar > {code} > Finally create table: > {code} > create table test_json (c1 boolean) ROW FORMAT SERDE > 'org.openx.data.jsonserde.JsonSerDe'; > {code} > Logs for add jar: > {code} > 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar' > 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/01/23 23:48:34 INFO SessionState: Added > /tmp/json-serde-1.3-jar-with-dependencies.jar to class path > 15/01/23 23:48:34 INFO SessionState: Added resource: > /tmp/json-serde-1.3-jar-with-dependencies.jar > 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR > /tmp/json-serde-1.3-jar-with-dependencies.jar at > http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with > timestamp 1422056914776 > 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result > Schema: List() > 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result > Schema: List() > {code} > Logs (with error) for create table: > {code} > 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running > query 'create table test_json (c1 boolean) ROW FORMAT SERDE > 'org.openx.data.jsonserde.JsonSerDe'' > 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table > test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed > 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this > point. hive.execution.engine=mr. > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating > a lock manager > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table > test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941103 end=1422056941104 duration=1 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis > 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json > position=13 > 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941104 end=1422056941240 duration=136 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: > Schema(fieldSchemas:null, properties:null) > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941071 end=1422056941252 duration=181 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json > (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' > 15/01/23 23:49:01 INFO log.PerfLogger: start=1422056941067 end=1422056941258 duration=191 > from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 INFO log.PerfLogger: from=org.apache.hadoop.hive.ql.Driver> > 15/01/23 23:49:01 WARN security.ShellBasedUnixGroupsMapping: got exception > trying to get groups for user anonymous > org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user > at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) >
[jira] [Updated] (SPARK-11032) Failure to resolve having correctly
[ https://issues.apache.org/jira/browse/SPARK-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11032: -- Assignee: Wenchen Fan > Failure to resolve having correctly > --- > > Key: SPARK-11032 > URL: https://issues.apache.org/jira/browse/SPARK-11032 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Michael Armbrust >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 1.6.0 > > > This is a regression from Spark 1.4 > {code} > Seq(("michael", 30)).toDF("name", "age").registerTempTable("people") > sql("SELECT MIN(t0.age) FROM (SELECT * FROM PEOPLE WHERE age > 0) t0 > HAVING(COUNT(1) > 0)").explain(true) > == Parsed Logical Plan == > 'Filter cast(('COUNT(1) > 0) as boolean) > 'Project [unresolvedalias('MIN('t0.age))] > 'Subquery t0 >'Project [unresolvedalias(*)] > 'Filter ('age > 0) > 'UnresolvedRelation [PEOPLE], None > == Analyzed Logical Plan == > _c0: int > Filter cast((count(1) > cast(0 as bigint)) as boolean) > Aggregate [min(age#6) AS _c0#9] > Subquery t0 >Project [name#5,age#6] > Filter (age#6 > 0) > Subquery people > Project [_1#3 AS name#5,_2#4 AS age#6] >LocalRelation [_1#3,_2#4], [[michael,30]] > == Optimized Logical Plan == > Filter (count(1) > 0) > Aggregate [min(age#6) AS _c0#9] > Project [_2#4 AS age#6] >Filter (_2#4 > 0) > LocalRelation [_1#3,_2#4], [[michael,30]] > == Physical Plan == > Filter (count(1) > 0) > TungstenAggregate(key=[], > functions=[(min(age#6),mode=Final,isDistinct=false)], output=[_c0#9]) > TungstenExchange SinglePartition >TungstenAggregate(key=[], > functions=[(min(age#6),mode=Partial,isDistinct=false)], output=[min#12]) > TungstenProject [_2#4 AS age#6] > Filter (_2#4 > 0) > LocalTableScan [_1#3,_2#4], [[michael,30]] > Code Generation: true > {code} > {code} > Caused by: java.lang.UnsupportedOperationException: Cannot evaluate > expression: count(1) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:188) > at > org.apache.spark.sql.catalyst.expressions.Count.eval(aggregates.scala:156) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:327) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate$$anonfun$create$2.apply(predicates.scala:38) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117) > at > org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11076) Decimal Support for Ceil/Floor
[ https://issues.apache.org/jira/browse/SPARK-11076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11076: -- Assignee: Cheng Hao > Decimal Support for Ceil/Floor > -- > > Key: SPARK-11076 > URL: https://issues.apache.org/jira/browse/SPARK-11076 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao > Fix For: 1.6.0 > > > Currently, Ceil & Floor doesn't support decimal, but Hive does. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11068) Add callback to query execution
[ https://issues.apache.org/jira/browse/SPARK-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11068: -- Assignee: Wenchen Fan > Add callback to query execution > --- > > Key: SPARK-11068 > URL: https://issues.apache.org/jira/browse/SPARK-11068 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11123) Inprove HistoryServer with multithread to relay logs
[ https://issues.apache.org/jira/browse/SPARK-11123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11123. --- Resolution: Duplicate [~xietingwen] please search JIRAs before opening a new one. > Inprove HistoryServer with multithread to relay logs > > > Key: SPARK-11123 > URL: https://issues.apache.org/jira/browse/SPARK-11123 > Project: Spark > Issue Type: Improvement >Reporter: Xie Tingwen > > Now,with Spark 1.4,when I restart HistoryServer,it took over 30 hours to > replay over 40 000 log file. What's more,when I have started it,it may take > half an hour to relay it and block other logs to be replayed.How about > rewrite it with multithread to accelerate replay log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11124) JsonParser/Generator should be closed for resource recycle
[ https://issues.apache.org/jira/browse/SPARK-11124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11124: -- Component/s: Spark Core > JsonParser/Generator should be closed for resource recycle > -- > > Key: SPARK-11124 > URL: https://issues.apache.org/jira/browse/SPARK-11124 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Navis >Priority: Trivial > > Some json parsers are not closed. parser in JacksonParser#parseJson, for > example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11128) strange NPE when writing in non-existing S3 bucket
[ https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11128: -- Component/s: Input/Output > strange NPE when writing in non-existing S3 bucket > -- > > Key: SPARK-11128 > URL: https://issues.apache.org/jira/browse/SPARK-11128 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.1 >Reporter: mathieu despriee >Priority: Minor > > For the record, as it's relatively minor, and related to s3n (not tested with > s3a). > By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, > with a simple df.write.parquet(s3path). > We got a NPE (see stack trace below), which is very misleading. > java.lang.NullPointerException > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source
[ https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-11102: --- Summary: Uninformative exception when specifing non-exist input for JSON data source (was: Unreadable exception when specifing non-exist input for JSON data source) > Uninformative exception when specifing non-exist input for JSON data source > --- > > Key: SPARK-11102 > URL: https://issues.apache.org/jira/browse/SPARK-11102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > > If I specify a non-exist input path for json data source, the following > exception will be thrown, it is not readable. > {code} > 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 19.9 KB, free 251.4 KB) > 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB) > 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at > :19 > java.io.IOException: No input paths specified in job > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at > org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085) > at > org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100) > at > org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:28) > at $iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC.(:32) > at $iwC$$iwC.(:34) > at $iwC.(:36) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11137) Make StreamingContext.stop() exception-safe
Felix Cheung created SPARK-11137: Summary: Make StreamingContext.stop() exception-safe Key: SPARK-11137 URL: https://issues.apache.org/jira/browse/SPARK-11137 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.1 Reporter: Felix Cheung Priority: Minor In StreamingContext.stop(), when an exception is thrown the rest of the stop/cleanup action is aborted. Discussed in https://github.com/apache/spark/pull/9116, srowen commented Hm, this is getting unwieldy. There are several nested try blocks here. The same argument goes for many of these methods -- if one fails should they not continue trying? A more tidy solution would be to execute a series of () -> Unit code blocks that perform some cleanup and make sure that they each fire in succession, regardless of the others. The final one to remove the shutdown hook could occur outside synchronization. I realize we're expanding the scope of the change here, but is it maybe worthwhile to go all the way here? Really, something similar could be done for SparkContext and there's an existing JIRA for it somewhere. At least, I'd prefer to either narrowly fix the deadlock here, or fix all of the finally-related issue separately and all at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11138) Flaky pyspark test: test_add_py_file
Marcelo Vanzin created SPARK-11138: -- Summary: Flaky pyspark test: test_add_py_file Key: SPARK-11138 URL: https://issues.apache.org/jira/browse/SPARK-11138 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.6.0 Reporter: Marcelo Vanzin This test fails pretty often when running PR tests. For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43800/console {noformat} == ERROR: test_add_py_file (__main__.AddFileTests) -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 396, in test_add_py_file res = self.sc.parallelize(range(2)).map(func).first() File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1315, in first rs = self.take(1) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1297, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/context.py", line 923, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ self.target_id, self.name) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value format(target_id, '.', name), value) Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 7, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1293, in takeUpToNumLeft yield next(iterator) File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/tests.py", line 388, in func from userlibrary import UserClass ImportError: cannot import name UserClass at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1427) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1415) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1414) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1414) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:793) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:793) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1639) at org.