[jira] [Created] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()
Davies Liu created SPARK-13601: -- Summary: Invoke task failure callbacks before calling outputstream.close() Key: SPARK-13601 URL: https://issues.apache.org/jira/browse/SPARK-13601 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Assignee: Davies Liu We need to submit another PR against Spark to call the task failure callbacks before Spark calls the close function on various output streams. For example, we need to intercept an exception and call TaskContext.markTaskFailed before calling close in the following code (in PairRDDFunctions.scala): {code} Utils.tryWithSafeFinally { while (iter.hasNext) { val record = iter.next() writer.write(record._1.asInstanceOf[AnyRef], record._2.asInstanceOf[AnyRef]) // Update bytes written metric every few records maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, recordsWritten) recordsWritten += 1 } } { writer.close() } {code} Changes to Spark should include unit tests to make sure this always work in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13582) Improve performance of parquet reader with dictionary encoding
[ https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13582. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11437 [https://github.com/apache/spark/pull/11437] > Improve performance of parquet reader with dictionary encoding > -- > > Key: SPARK-13582 > URL: https://issues.apache.org/jira/browse/SPARK-13582 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Right now, we replace the ids with value from a dictionary before accessing a > column. We could defer that, especially when some rows are filtered out, we > will not lookup this dictionary for those rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13598) Remove LeftSemiJoinBNL
Davies Liu created SPARK-13598: -- Summary: Remove LeftSemiJoinBNL Key: SPARK-13598 URL: https://issues.apache.org/jira/browse/SPARK-13598 Project: Spark Issue Type: Task Components: SQL Reporter: Davies Liu Broadcast left semi join without joining keys is already supported in BroadcastNestedLoopJoin, it has the same implementation as LeftSemiJoinBNL, we should remove that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13511) Add wholestage codegen for limit
[ https://issues.apache.org/jira/browse/SPARK-13511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13511. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11391 [https://github.com/apache/spark/pull/11391] > Add wholestage codegen for limit > > > Key: SPARK-13511 > URL: https://issues.apache.org/jira/browse/SPARK-13511 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > Current limit operator doesn't support wholestage codegen. This issue is open > to add support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13582) Improve performance of parquet reader with dictionary encoding
Davies Liu created SPARK-13582: -- Summary: Improve performance of parquet reader with dictionary encoding Key: SPARK-13582 URL: https://issues.apache.org/jira/browse/SPARK-13582 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Right now, we replace the ids with value from a dictionary before accessing a column. We could defer that, especially when some rows are filtered out, we will not lookup this dictionary for those rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13541) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
[ https://issues.apache.org/jira/browse/SPARK-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171216#comment-15171216 ] Davies Liu commented on SPARK-13541: [~nongli] Since the test failed in vectorized parquet reader, so I assigned this to you. > Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType > --- > > Key: SPARK-13541 > URL: https://issues.apache.org/jira/browse/SPARK-13541 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Nong Li > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52140/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13415) Visualize subquery in SQL web UI
[ https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13415: --- Comment: was deleted (was: [~nongli] Since the test failed in vectorized parquet reader, so I assigned this to you.) > Visualize subquery in SQL web UI > > > Key: SPARK-13415 > URL: https://issues.apache.org/jira/browse/SPARK-13415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, uncorrelated scalar subqueries are not showed in SQL tab. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13415) Visualize subquery in SQL web UI
[ https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171215#comment-15171215 ] Davies Liu commented on SPARK-13415: [~nongli] Since the test failed in vectorized parquet reader, so I assigned this to you. > Visualize subquery in SQL web UI > > > Key: SPARK-13415 > URL: https://issues.apache.org/jira/browse/SPARK-13415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu > > Right now, uncorrelated scalar subqueries are not showed in SQL tab. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13541) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
Davies Liu created SPARK-13541: -- Summary: Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType Key: SPARK-13541 URL: https://issues.apache.org/jira/browse/SPARK-13541 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Davies Liu Assignee: Nong Li https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52140/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13415) Visualize subquery in SQL web UI
[ https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-13415: -- Assignee: Davies Liu > Visualize subquery in SQL web UI > > > Key: SPARK-13415 > URL: https://issues.apache.org/jira/browse/SPARK-13415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, uncorrelated scalar subqueries are not showed in SQL tab. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13523) Reuse the exchanges in a query
Davies Liu created SPARK-13523: -- Summary: Reuse the exchanges in a query Key: SPARK-13523 URL: https://issues.apache.org/jira/browse/SPARK-13523 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu In exchange, the RDD will be materialized (shuffled or collected), it's a good point to eliminate common part of a query. In some TPCDS queries (for example, Q64), the same exchange (ShuffleExchange or BroadcastExchange) could be used multiple times, we should re-use them to avoid the duplicated work and reduce the memory for broadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13465) Add a task failure listener to TaskContext
[ https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13465. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11340 [https://github.com/apache/spark/pull/11340] > Add a task failure listener to TaskContext > -- > > Key: SPARK-13465 > URL: https://issues.apache.org/jira/browse/SPARK-13465 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > TaskContext supports task completion callback, which gets called regardless > of task failures. However, there is no way for the listener to know if there > is an error. This ticket proposes adding a new listener that gets called when > a task fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13499) Optimize vectorized parquet reader for dictionary encoded data and RLE decoding
[ https://issues.apache.org/jira/browse/SPARK-13499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13499. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11375 [https://github.com/apache/spark/pull/11375] > Optimize vectorized parquet reader for dictionary encoded data and RLE > decoding > --- > > Key: SPARK-13499 > URL: https://issues.apache.org/jira/browse/SPARK-13499 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12313) getPartitionsByFilter doesnt handle predicates on all / multiple Partition Columns
[ https://issues.apache.org/jira/browse/SPARK-12313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12313. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11328 [https://github.com/apache/spark/pull/11328] > getPartitionsByFilter doesnt handle predicates on all / multiple Partition > Columns > -- > > Key: SPARK-12313 > URL: https://issues.apache.org/jira/browse/SPARK-12313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Gobinathan SP >Priority: Critical > Fix For: 2.0.0 > > > When enabled spark.sql.hive.metastorePartitionPruning, the > getPartitionsByFilter is used > For a table partitioned by p1 and p2, when triggered hc.sql("select col > from tabl1 where p1='p1V' and p2= 'p2V' ") > The HiveShim identifies the Predicates and ConvertFilters returns p1='p1V' > and col2= 'p2V' . The same is passed to the getPartitionsByFilter method as > filter string. > On these cases the partitions are not returned from Hive's > getPartitionsByFilter method. As a result, for the sql, the number of > returned rows is always zero. > However, filter on a single column always works. Probalbly it doesn't come > through this route > I'm using Oracle for Metstore V0.13.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13476) Generate does not always output UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-13476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13476. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11354 [https://github.com/apache/spark/pull/11354] > Generate does not always output UnsafeRow > - > > Key: SPARK-13476 > URL: https://issues.apache.org/jira/browse/SPARK-13476 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Generate does not output UnsafeRow when join is true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13376) Improve column pruning
[ https://issues.apache.org/jira/browse/SPARK-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13376. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11354 [https://github.com/apache/spark/pull/11354] > Improve column pruning > -- > > Key: SPARK-13376 > URL: https://issues.apache.org/jira/browse/SPARK-13376 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Column pruning could help to skip to columns that are not used by any > following operators. > The current implementation only work with a few of logical plans, we should > improve that to support all of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13250) Make vectorized parquet reader work as the build side of a broadcast join
[ https://issues.apache.org/jira/browse/SPARK-13250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13250. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11141 [https://github.com/apache/spark/pull/11141] > Make vectorized parquet reader work as the build side of a broadcast join > - > > Key: SPARK-13250 > URL: https://issues.apache.org/jira/browse/SPARK-13250 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Nong Li > Fix For: 2.0.0 > > > The issue is that the build side requires unsafe row in certain > optimizations. The vectorized parquet reader explicitly does not want to > product unsafe rows in general. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13476) Generate does not always output UnsafeRow
Davies Liu created SPARK-13476: -- Summary: Generate does not always output UnsafeRow Key: SPARK-13476 URL: https://issues.apache.org/jira/browse/SPARK-13476 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Generate does not output UnsafeRow when join is true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13467) abstract python function to simplify pyspark code
[ https://issues.apache.org/jira/browse/SPARK-13467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13467. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11342 [https://github.com/apache/spark/pull/11342] > abstract python function to simplify pyspark code > - > > Key: SPARK-13467 > URL: https://issues.apache.org/jira/browse/SPARK-13467 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Trivial > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13431. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11331 [https://github.com/apache/spark/pull/11331] > Maven build fails due to: Method code too large! in Catalyst > > > Key: SPARK-13431 > URL: https://issues.apache.org/jira/browse/SPARK-13431 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Priority: Blocker > Fix For: 2.0.0 > > > Cannot build the project when run the normal build commands: > eg. > {code} > build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0 clean package > ./make-distribution.sh --name test --tgz -Phadoop-2.6 > {code} > Integration builds are also failing: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console > https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console > It looks like this is the commit that introduced the issue: > https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13376) Improve column pruning
[ https://issues.apache.org/jira/browse/SPARK-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13376. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11256 [https://github.com/apache/spark/pull/11256] > Improve column pruning > -- > > Key: SPARK-13376 > URL: https://issues.apache.org/jira/browse/SPARK-13376 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Column pruning could help to skip to columns that are not used by any > following operators. > The current implementation only work with a few of logical plans, we should > improve that to support all of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13463) Support Column pruning for Dataset logical plan
Davies Liu created SPARK-13463: -- Summary: Support Column pruning for Dataset logical plan Key: SPARK-13463 URL: https://issues.apache.org/jira/browse/SPARK-13463 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Column pruning may not work with some logical plan for Dataset, we need check that and make sure they works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13373) Generate code for sort merge join
[ https://issues.apache.org/jira/browse/SPARK-13373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13373. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11248 [https://github.com/apache/spark/pull/11248] > Generate code for sort merge join > - > > Key: SPARK-13373 > URL: https://issues.apache.org/jira/browse/SPARK-13373 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159682#comment-15159682 ] Davies Liu commented on SPARK-13431: https://github.com/apache/spark/pull/11331 > Maven build fails due to: Method code too large! in Catalyst > > > Key: SPARK-13431 > URL: https://issues.apache.org/jira/browse/SPARK-13431 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Priority: Blocker > > Cannot build the project when run the normal build commands: > eg. > {code} > build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0 clean package > ./make-distribution.sh --name test --tgz -Phadoop-2.6 > {code} > Integration builds are also failing: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console > https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console > It looks like this is the commit that introduced the issue: > https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159670#comment-15159670 ] Davies Liu commented on SPARK-13431: I'd like to split ExpressionParser.g, or we can't touch it anymore (may break sbt break next time), unless we can switch to ANTR4 soon. > Maven build fails due to: Method code too large! in Catalyst > > > Key: SPARK-13431 > URL: https://issues.apache.org/jira/browse/SPARK-13431 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Stavros Kontopoulos >Priority: Blocker > > Cannot build the project when run the normal build commands: > eg. > {code} > build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0 clean package > ./make-distribution.sh --name test --tgz -Phadoop-2.6 > {code} > Integration builds are also failing: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console > https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console > It looks like this is the commit that introduced the issue: > https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13329) Considering output for statistics of logical plan
[ https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13329. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11210 [https://github.com/apache/spark/pull/11210] > Considering output for statistics of logical plan > - > > Key: SPARK-13329 > URL: https://issues.apache.org/jira/browse/SPARK-13329 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > The current implementation of statistics of UnaryNode does not considering > output (for example, Project), we should considering it to have a better > guess. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13358) Retrieve grep path when doing Benchmark
[ https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13358. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11231 [https://github.com/apache/spark/pull/11231] > Retrieve grep path when doing Benchmark > --- > > Key: SPARK-13358 > URL: https://issues.apache.org/jira/browse/SPARK-13358 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13306) Uncorrelated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13306. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11190 [https://github.com/apache/spark/pull/11190] > Uncorrelated scalar subquery > > > Key: SPARK-13306 > URL: https://issues.apache.org/jira/browse/SPARK-13306 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > A scalar subquery is a subquery that only generate single row and single > column, could be used as part of expression. > Uncorrelated scalar subquery means it does not has a reference to external > table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13310) Missing Sorting Columns in Generate
[ https://issues.apache.org/jira/browse/SPARK-13310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13310. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11198 [https://github.com/apache/spark/pull/11198] > Missing Sorting Columns in Generate > --- > > Key: SPARK-13310 > URL: https://issues.apache.org/jira/browse/SPARK-13310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > {code} > // case 1: missing sort columns are resolvable if join is true > sql("SELECT explode(a) AS val, b FROM data WHERE b < 2 order by val, c") > // case 2: missing sort columns are not resolvable if join is false. > Thus, issue a message in this case > sql("SELECT explode(a) AS val FROM data order by val, c") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13415) Visualize subquery on SQL tab
Davies Liu created SPARK-13415: -- Summary: Visualize subquery on SQL tab Key: SPARK-13415 URL: https://issues.apache.org/jira/browse/SPARK-13415 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Right now, uncorrelated scalar subqueries are not showed in SQL tab. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13408) Exception in resultHandler will shutdown SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13408. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11280 [https://github.com/apache/spark/pull/11280] > Exception in resultHandler will shutdown SparkContext > - > > Key: SPARK-13408 > URL: https://issues.apache.org/jira/browse/SPARK-13408 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > {code} > davies@localhost:~/work/spark$ bin/spark-submit > python/pyspark/sql/dataframe.py > NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes > ahead of assembly. > 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback > address: 127.0.0.1; using 192.168.0.143 instead (on interface en0) > 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > ** > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 554, in pyspark.sql.dataframe.DataFrame.alias > Failed example: > joined_df.select(col("df_as1.name"), col("df_as2.name"), > col("df_as2.age")).collect() > Differences (ndiff with -expected +actual): > - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', > name=u'Alice', age=2)] > + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', > name=u'Bob', age=5)] > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) > at java.util.PriorityQueue.offer(PriorityQueue.java:329) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at > org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31) > at > org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185) > ... 4 more > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGene
[jira] [Resolved] (SPARK-13304) Broadcast join with two ints could be very slow
[ https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13304. Resolution: Fixed Assignee: Davies Liu Fix Version/s: 2.0.0 Fixed by https://github.com/apache/spark/pull/11130 > Broadcast join with two ints could be very slow > --- > > Key: SPARK-13304 > URL: https://issues.apache.org/jira/browse/SPARK-13304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > If the two join columns have the same value, the hash code of them will be (a > ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13409) Log the stacktrace when stopping a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155439#comment-15155439 ] Davies Liu commented on SPARK-13409: [~rxin] I think we should remember the stacktrace, put that in the message when we tried to access the stopped the SparkContext, this will be much usefully than just logging it (hard to find the log). We already did similar things when creating a SparkContext. > Log the stacktrace when stopping a SparkContext > --- > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should log that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow
[ https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155438#comment-15155438 ] Davies Liu commented on SPARK-13213: It depends, I'm open to any reasonable solution. > BroadcastNestedLoopJoin is very slow > > > Key: SPARK-13213 > URL: https://issues.apache.org/jira/browse/SPARK-13213 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > Since we have improve the performance of CartisianProduct, which should be > faster and robuster than BroacastNestedLoopJoin, we should do > CartisianProduct instead of BroacastNestedLoopJoin, especially when the > broadcasted table is not that small. > Today, we hit a query that take very long time but still not finished, once > decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it > just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12567) Add aes_{encrypt,decrypt} UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12567. Resolution: Fixed Assignee: Kai Jiang Fix Version/s: 2.0.0 > Add aes_{encrypt,decrypt} UDFs > -- > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Kai Jiang > Fix For: 2.0.0 > > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12594) Outer Join Elimination by Filter Condition
[ https://issues.apache.org/jira/browse/SPARK-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12594. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10567 [https://github.com/apache/spark/pull/10567] > Outer Join Elimination by Filter Condition > -- > > Key: SPARK-12594 > URL: https://issues.apache.org/jira/browse/SPARK-12594 > Project: Spark > Issue Type: Improvement > Components: Optimizer, SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > Elimination of outer joins, if the predicates in the filter condition can > restrict the result sets so that all null-supplying rows are eliminated. > - full outer -> inner if both sides have such predicates > - left outer -> inner if the right side has such predicates > - right outer -> inner if the left side has such predicates > - full outer -> left outer if only the left side has such predicates > - full outer -> right outer if only the right side has such predicates > If applicable, this can greatly improve the performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13409) Remember the stacktrace when stop a SparkContext
[ https://issues.apache.org/jira/browse/SPARK-13409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13409: --- Component/s: Spark Core > Remember the stacktrace when stop a SparkContext > > > Key: SPARK-13409 > URL: https://issues.apache.org/jira/browse/SPARK-13409 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu > > Somethings we saw a stopped SparkContext, then have no idea it's stopped by > what, we should remember that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13409) Remember the stacktrace when stop a SparkContext
Davies Liu created SPARK-13409: -- Summary: Remember the stacktrace when stop a SparkContext Key: SPARK-13409 URL: https://issues.apache.org/jira/browse/SPARK-13409 Project: Spark Issue Type: Bug Reporter: Davies Liu Somethings we saw a stopped SparkContext, then have no idea it's stopped by what, we should remember that for troubleshooting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13408) Exception in resultHandler will shutdown SparkContext
Davies Liu created SPARK-13408: -- Summary: Exception in resultHandler will shutdown SparkContext Key: SPARK-13408 URL: https://issues.apache.org/jira/browse/SPARK-13408 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Shixiong Zhu {code} davies@localhost:~/work/spark$ bin/spark-submit python/pyspark/sql/dataframe.py NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. 16/02/19 12:46:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/02/19 12:46:24 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.0.143 instead (on interface en0) 16/02/19 12:46:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address ** File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 554, in pyspark.sql.dataframe.DataFrame.alias Failed example: joined_df.select(col("df_as1.name"), col("df_as2.name"), col("df_as2.age")).collect() Differences (ndiff with -expected +actual): - [Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', age=2)] + [Row(name=u'Alice', name=u'Alice', age=2), Row(name=u'Bob', name=u'Bob', age=5)] org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(PriorityQueue.java:329) at org.apache.spark.util.BoundedPriorityQueue.$plus$eq(BoundedPriorityQueue.scala:47) at org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) at org.apache.spark.util.BoundedPriorityQueue$$anonfun$$plus$plus$eq$1.apply(BoundedPriorityQueue.scala:41) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.util.BoundedPriorityQueue.foreach(BoundedPriorityQueue.scala:31) at org.apache.spark.util.BoundedPriorityQueue.$plus$plus$eq(BoundedPriorityQueue.scala:41) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1319) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$apply$46.apply(RDD.scala:1318) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:932) at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:929) at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:57) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1185) ... 4 more org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:649) at java.util.PriorityQueue.siftUp(PriorityQueue.java:627) at java.util.PriorityQueue.offer(Priority
[jira] [Created] (SPARK-13406) NPE in LazilyGeneratedOrdering
Davies Liu created SPARK-13406: -- Summary: NPE in LazilyGeneratedOrdering Key: SPARK-13406 URL: https://issues.apache.org/jira/browse/SPARK-13406 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Josh Rosen {code} File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy Failed example: sampled.groupBy("key").count().orderBy("key").show() Exception raised: Traceback (most recent call last): File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run compileflags, 1) in test.globs File "", line 1, in sampled.groupBy("key").count().orderBy("key").show() File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 217, in show print(self._jdf.showString(n, truncate)) File "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py", line 835, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", line 310, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o681.showString. : org.apache.spark.SparkDriverExecutionException: Execution error at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) at org.apache.spark.rdd.RDD.reduce(RDD.scala:919) at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305) at org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396) at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395) at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782) at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395) at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477) at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167) at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:290) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.Gatew
[jira] [Created] (SPARK-13404) Create the variables for input when it's used
Davies Liu created SPARK-13404: -- Summary: Create the variables for input when it's used Key: SPARK-13404 URL: https://issues.apache.org/jira/browse/SPARK-13404 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Right now, we create the variables in the first operator (usually InputAdapter), they could be wasted if most of rows after filtered out immediately. We should defer that until they are used by following operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13237) Generate broadcast outer join
[ https://issues.apache.org/jira/browse/SPARK-13237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13237. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11130 [https://github.com/apache/spark/pull/11130] > Generate broadcast outer join > - > > Key: SPARK-13237 > URL: https://issues.apache.org/jira/browse/SPARK-13237 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13376) Improve column pruning
Davies Liu created SPARK-13376: -- Summary: Improve column pruning Key: SPARK-13376 URL: https://issues.apache.org/jira/browse/SPARK-13376 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu Column pruning could help to skip to columns that are not used by any following operators. The current implementation only work with a few of logical plans, we should improve that to support all of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13373) Generate code for sort merge join
Davies Liu created SPARK-13373: -- Summary: Generate code for sort merge join Key: SPARK-13373 URL: https://issues.apache.org/jira/browse/SPARK-13373 Project: Spark Issue Type: New Feature Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
[ https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-13354. -- Resolution: Duplicate > Push filter throughout outer join when the condition can filter out empty row > -- > > Key: SPARK-13354 > URL: https://issues.apache.org/jira/browse/SPARK-13354 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > For a query > {code} > select * from a left outer join b on a.a = b.a where b.b > 10 > {code} > The condition `b.b > 10` will filter out all the row that the b part of it is > empty. > In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
Davies Liu created SPARK-13354: -- Summary: Push filter throughout outer join when the condition can filter out empty row Key: SPARK-13354 URL: https://issues.apache.org/jira/browse/SPARK-13354 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu For a query {code} select * from a left outer join b on a.a = b.a where b.b > 10 {code} The condition `b.b > 10` will filter out all the row that the b part of it is empty. In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13353) Use UnsafeRowSerializer to collect DataFrame
Davies Liu created SPARK-13353: -- Summary: Use UnsafeRowSerializer to collect DataFrame Key: SPARK-13353 URL: https://issues.apache.org/jira/browse/SPARK-13353 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu UnsafeRowSerializer should be more efficient than JavaSerializer or KyroSerializer for DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13352) BlockFetch does not scale well on large block
Davies Liu created SPARK-13352: -- Summary: BlockFetch does not scale well on large block Key: SPARK-13352 URL: https://issues.apache.org/jira/browse/SPARK-13352 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu BlockManager.getRemoteBytes() perform poorly on large block {code} test("block manager") { val N = 500 << 20 val bm = sc.env.blockManager val blockId = TaskResultBlockId(0) val buffer = ByteBuffer.allocate(N) buffer.limit(N) bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER) val result = bm.getRemoteBytes(blockId) assert(result.isDefined) assert(result.get.limit() === (N)) } {code} Here are runtime for different block sizes: {code} 50M3 seconds 100M 7 seconds 250M 33 seconds 500M 2 min {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13351) Column pruning fails on expand
Davies Liu created SPARK-13351: -- Summary: Column pruning fails on expand Key: SPARK-13351 URL: https://issues.apache.org/jira/browse/SPARK-13351 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Optimizer can't pruning the columns in Expand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13348) Avoid duplicated broadcasts
Davies Liu created SPARK-13348: -- Summary: Avoid duplicated broadcasts Key: SPARK-13348 URL: https://issues.apache.org/jira/browse/SPARK-13348 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu An broadcasted table could be used multiple times in a query, we should cache them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13347) Reuse the shuffle for duplicated exchange
Davies Liu created SPARK-13347: -- Summary: Reuse the shuffle for duplicated exchange Key: SPARK-13347 URL: https://issues.apache.org/jira/browse/SPARK-13347 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu In TPCDS query 47, the same exchange is used three times, we should re-use the ShuffleRowRDD to skip the duplicated stages. {code} with v1 as( select i_category, i_brand, s_store_name, s_company_name, d_year, d_moy, sum(ss_sales_price) sum_sales, avg(sum(ss_sales_price)) over (partition by i_category, i_brand, s_store_name, s_company_name, d_year) avg_monthly_sales, rank() over (partition by i_category, i_brand, s_store_name, s_company_name order by d_year, d_moy) rn from item, store_sales, date_dim, store where ss_item_sk = i_item_sk and ss_sold_date_sk = d_date_sk and ss_store_sk = s_store_sk and ( d_year = 1999 or ( d_year = 1999-1 and d_moy =12) or ( d_year = 1999+1 and d_moy =1) ) group by i_category, i_brand, s_store_name, s_company_name, d_year, d_moy), v2 as( select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, v1.d_year, v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales psum, v1_lead.sum_sales nsum from v1, v1 v1_lag, v1 v1_lead where v1.i_category = v1_lag.i_category and v1.i_category = v1_lead.i_category and v1.i_brand = v1_lag.i_brand and v1.i_brand = v1_lead.i_brand and v1.s_store_name = v1_lag.s_store_name and v1.s_store_name = v1_lead.s_store_name and v1.s_company_name = v1_lag.s_company_name and v1.s_company_name = v1_lead.s_company_name and v1.rn = v1_lag.rn + 1 and v1.rn = v1_lead.rn - 1) select * from v2 where d_year = 1999 and avg_monthly_sales > 0 and case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1 order by sum_sales - avg_monthly_sales, 3 limit 100 {code} Since the SparkPlan is just a tree (not DAG), we can only do this in SparkPlan.execute() or final rule. And we should also have a way to compare two SparkPlan whether they have same result or not (they may have different exprId, we should compare them after bind). An quick experiment showed that we could have 2X improvement on this query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13329) Considering output for statistics of logicol plan
Davies Liu created SPARK-13329: -- Summary: Considering output for statistics of logicol plan Key: SPARK-13329 URL: https://issues.apache.org/jira/browse/SPARK-13329 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu The current implementation of statistics of UnaryNode does not considering output (for example, Project), we should considering it to have a better guess. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.
[ https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147591#comment-15147591 ] Davies Liu commented on SPARK-13323: HiveTypeCoercion is pretty complicated, we may don't want to duplicate that in Python. What's the problem right? Or just because of the TODO? > Type cast support in type inference during merging types. > - > > Key: SPARK-13323 > URL: https://issues.apache.org/jira/browse/SPARK-13323 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > As described in {{types.py}}, there is a todo {{TODO: type cast (such as int > -> long)}}. > Currently, PySpark infers types but does not try to find compatible types > when the given types are different during merging schemas. > I think this can be done by resembling > {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of > both is compared to {{StingType}}, just convert them into string. > It looks the possible leaf data types are below: > {code} > # Mapping Python types to Spark SQL DataType > _type_mappings = { > type(None): NullType, > bool: BooleanType, > int: LongType, > float: DoubleType, > str: StringType, > bytearray: BinaryType, > decimal.Decimal: DecimalType, > datetime.date: DateType, > datetime.datetime: TimestampType, > datetime.time: TimestampType, > } > {code} > and they are converted pretty well to string as below: > {code} > >>> print str(None) > None > >>> print str(True) > True > >>> print str(float(0.1)) > 0.1 > >>> str(bytearray([255])) > '\xff' > >>> str(decimal.Decimal()) > '0' > >>> str(datetime.date(1,1,1)) > '0001-01-01' > >>> str(datetime.datetime(1,1,1)) > '0001-01-01 00:00:00' > >>> str(datetime.time(1,1,1)) > '01:01:01' > {code} > First, I tried to find the relevant issue with this but I couldn't. Please > mark this as a duplicate if there is already. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13311) prettyString of IN is not good
Davies Liu created SPARK-13311: -- Summary: prettyString of IN is not good Key: SPARK-13311 URL: https://issues.apache.org/jira/browse/SPARK-13311 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu In(i_class,[Ljava.lang.Object;@1a575883)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12544) Support window functions in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145581#comment-15145581 ] Davies Liu commented on SPARK-12544: We are retiring HiveContext in 2.0, we may update the docs together. > Support window functions in SQLContext > -- > > Key: SPARK-12544 > URL: https://issues.apache.org/jira/browse/SPARK-12544 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Herman van Hovell > Labels: releasenotes > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13304) Broadcast join with two ints could be very slow
[ https://issues.apache.org/jira/browse/SPARK-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13304: --- Component/s: SQL > Broadcast join with two ints could be very slow > --- > > Key: SPARK-13304 > URL: https://issues.apache.org/jira/browse/SPARK-13304 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu > > If the two join columns have the same value, the hash code of them will be (a > ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13306) Uncorrelated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13306: --- Component/s: SQL > Uncorrelated scalar subquery > > > Key: SPARK-13306 > URL: https://issues.apache.org/jira/browse/SPARK-13306 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > A scalar subquery is a subquery that only generate single row and single > column, could be used as part of expression. > Uncorrelated scalar subquery means it does not has a reference to external > table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13306) Uncorrelated scalar subquery
Davies Liu created SPARK-13306: -- Summary: Uncorrelated scalar subquery Key: SPARK-13306 URL: https://issues.apache.org/jira/browse/SPARK-13306 Project: Spark Issue Type: New Feature Reporter: Davies Liu Assignee: Davies Liu A scalar subquery is a subquery that only generate single row and single column, could be used as part of expression. Uncorrelated scalar subquery means it does not has a reference to external table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12544) Support window functions in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145442#comment-15145442 ] Davies Liu commented on SPARK-12544: [~hvanhovell] Does window functions sill require HiveContext? Or we should update the docs/comments for Window functions. > Support window functions in SQLContext > -- > > Key: SPARK-12544 > URL: https://issues.apache.org/jira/browse/SPARK-12544 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Herman van Hovell > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12544) Support window functions in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12544. Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.0.0 Since we updated the parser, Window function can work in SQLContext. > Support window functions in SQLContext > -- > > Key: SPARK-12544 > URL: https://issues.apache.org/jira/browse/SPARK-12544 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Herman van Hovell > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13304) Broadcast join with two ints could be very slow
Davies Liu created SPARK-13304: -- Summary: Broadcast join with two ints could be very slow Key: SPARK-13304 URL: https://issues.apache.org/jira/browse/SPARK-13304 Project: Spark Issue Type: Bug Reporter: Davies Liu If the two join columns have the same value, the hash code of them will be (a ^ b), which is 0, then the HashMap will be very very slow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12962) PySpark support covar_samp and covar_pop
[ https://issues.apache.org/jira/browse/SPARK-12962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12962. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10876 [https://github.com/apache/spark/pull/10876] > PySpark support covar_samp and covar_pop > > > Key: SPARK-12962 > URL: https://issues.apache.org/jira/browse/SPARK-12962 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > PySpark support covar_samp and covar_pop -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12705) Sorting column can't be resolved if it's not in projection
[ https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12705. Resolution: Fixed Issue resolved by pull request 11153 [https://github.com/apache/spark/pull/11153] > Sorting column can't be resolved if it's not in projection > -- > > Key: SPARK-12705 > URL: https://issues.apache.org/jira/browse/SPARK-12705 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > The following query can't be resolved: > {code} > scala> sqlContext.sql("select sum(a) over () from (select 1 as a, 2 as b) t > order by b").explain() > org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input > columns: [_c0]; line 1 pos 63 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12949) Support common expression elimination
[ https://issues.apache.org/jira/browse/SPARK-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143787#comment-15143787 ] Davies Liu commented on SPARK-12949: After some prototype, enable common expression elimination could have 10+% improvement on stddev, but 50% regression on Kurtosis, have not figure why, maybe JIT can already eliminate the common expressions (given the fact that Kurtosis is only 20% slower than stddev)? If yes, we may not want to do this. > Support common expression elimination > - > > Key: SPARK-12949 > URL: https://issues.apache.org/jira/browse/SPARK-12949 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13293) Generate code for Expand
Davies Liu created SPARK-13293: -- Summary: Generate code for Expand Key: SPARK-13293 URL: https://issues.apache.org/jira/browse/SPARK-13293 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12915) SQL metrics for generated operators
[ https://issues.apache.org/jira/browse/SPARK-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-12915: -- Assignee: Davies Liu > SQL metrics for generated operators > --- > > Key: SPARK-12915 > URL: https://issues.apache.org/jira/browse/SPARK-12915 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > The metrics should be very efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13234) Remove duplicated SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-13234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13234. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11163 [https://github.com/apache/spark/pull/11163] > Remove duplicated SQL metrics > - > > Key: SPARK-13234 > URL: https://issues.apache.org/jira/browse/SPARK-13234 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Fix For: 2.0.0 > > > For lots of SQL operators, we have metrics for both of input and output, the > number of input rows should be exactly the number of output rows of child, we > could only have metrics for output rows. > After we improve the performance using whole stage codegen, the overhead of > SQL metrics are not trivial anymore, we should avoid that if it's not > necessary. > Some of the operator does not have SQL metrics, we should add that for them. > For those operators that have the same number of rows from input and output > (for example, Projection, we may don't need that). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12706) support grouping/grouping_id function together group set
[ https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12706. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10677 [https://github.com/apache/spark/pull/10677] > support grouping/grouping_id function together group set > > > Key: SPARK-12706 > URL: https://issues.apache.org/jira/browse/SPARK-12706 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction > http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13174) Add API and options for csv data sources
[ https://issues.apache.org/jira/browse/SPARK-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15141438#comment-15141438 ] Davies Liu commented on SPARK-13174: [~GayathriMurali] Yes, there is a way, but it's not as good as other builtin datasources (like parquet, json, jdbc) > Add API and options for csv data sources > > > Key: SPARK-13174 > URL: https://issues.apache.org/jira/browse/SPARK-13174 > Project: Spark > Issue Type: New Feature > Components: Input/Output >Reporter: Davies Liu > > We should have a API to load csv data source (with some options as > arguments), similar to json() and jdbc() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12705) Sorting column can't be resolved if it's not in projection
[ https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-12705: Assignee: Davies Liu (was: Xiao Li) The Q98 is still can't be analyzed, I will send a PR to fix that. > Sorting column can't be resolved if it's not in projection > -- > > Key: SPARK-12705 > URL: https://issues.apache.org/jira/browse/SPARK-12705 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > The following query can't be resolved: > {code} > scala> sqlContext.sql("select sum(a) over () from (select 1 as a, 2 as b) t > order by b").explain() > org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input > columns: [_c0]; line 1 pos 63 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12950) Improve performance of BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-12950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12950. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11010 [https://github.com/apache/spark/pull/11010] > Improve performance of BytesToBytesMap > -- > > Key: SPARK-12950 > URL: https://issues.apache.org/jira/browse/SPARK-12950 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Fix For: 2.0.0 > > > When benchmark generated aggregate with grouping keys, the profiling show > that lookup in BytesToBytesMap took about 90% of the CPU time, we should > optimize it. > After profiling with jvisualvm, here are the things that take most of the > time: > 1. decode address from Long to baseObject and offset > 2. calculate hash code > 3. compare the bytes (equality check) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13249) Filter null keys for inner join
Davies Liu created SPARK-13249: -- Summary: Filter null keys for inner join Key: SPARK-13249 URL: https://issues.apache.org/jira/browse/SPARK-13249 Project: Spark Issue Type: Improvement Reporter: Davies Liu For inner join, the join key with null in it will not match each other, so we could insert a Filter before inner join (could be pushed down), then we don't need to check nullability of keys while joining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13173) Fail to load CSV file with NPE
[ https://issues.apache.org/jira/browse/SPARK-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13173: --- Description: {code} id|end_date|start_date|location 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD 3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY 4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY 5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD {code} {code} adult_df = sqlContext.read.\ format("org.apache.spark.sql.execution.datasources.csv").\ option("header", "true").option("delimiter", "|").\ option("inferSchema", "true").load("/tmp/dataframe_sample.csv") {code} {code} Py4JJavaError: An error occurred while calling o239.load. : java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) at scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93) at scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665) at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:290) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) {code} was: {code} id|end_date|start_date|location 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD 3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY 4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY 5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD {code} {code} adult_df = sqlContext.read.\ format("org.apache.spark.sql.execution.datasources.csv").\ option("header", "false").option("delimiter", "|").\ option("inferSchema", "true").load("/tmp/dataframe_sample.csv") {code} {code} Py4JJavaError: An error occurred while calling o239.load. : java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) at scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93) at scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665) at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:290)
[jira] [Closed] (SPARK-13173) Fail to load CSV file with NPE
[ https://issues.apache.org/jira/browse/SPARK-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-13173. -- Resolution: Cannot Reproduce > Fail to load CSV file with NPE > -- > > Key: SPARK-13173 > URL: https://issues.apache.org/jira/browse/SPARK-13173 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Reporter: Davies Liu > > {code} > id|end_date|start_date|location > 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF > 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD > 3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY > 4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY > 5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD > {code} > {code} > adult_df = sqlContext.read.\ > format("org.apache.spark.sql.execution.datasources.csv").\ > option("header", "false").option("delimiter", "|").\ > option("inferSchema", "true").load("/tmp/dataframe_sample.csv") > {code} > {code} > Py4JJavaError: An error occurred while calling o239.load. > : java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) > at py4j.Gateway.invoke(Gateway.java:290) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:209) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12992) Vectorize parquet decoding using ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12992. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11055 [https://github.com/apache/spark/pull/11055] > Vectorize parquet decoding using ColumnarBatch > -- > > Key: SPARK-12992 > URL: https://issues.apache.org/jira/browse/SPARK-12992 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Apache Spark > Fix For: 2.0.0 > > > Parquet files benefit from vectorized decoding. ColumnarBatches have been > designed to support this. This means that a single encoded parquet column is > decoded to a single ColumnVector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13237) Generate broadcast outer join
Davies Liu created SPARK-13237: -- Summary: Generate broadcast outer join Key: SPARK-13237 URL: https://issues.apache.org/jira/browse/SPARK-13237 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13095) improve performance of hash join with dimension table
[ https://issues.apache.org/jira/browse/SPARK-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13095. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11065 [https://github.com/apache/spark/pull/11065] > improve performance of hash join with dimension table > - > > Key: SPARK-13095 > URL: https://issues.apache.org/jira/browse/SPARK-13095 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > The join key is usually an integer or long (primary key, unique), we could > have special HashRelation for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13215: --- Component/s: SQL > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes
[ https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12840: --- Component/s: SQL > Support passing arbitrary objects (not just expressions) into code generated > classes > > > Key: SPARK-12840 > URL: https://issues.apache.org/jira/browse/SPARK-12840 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > As of now, our code generator only allows passing Expression objects into the > generated class as arguments. In order to support whole-stage codegen (e.g. > for broadcast joins), the generated classes need to accept other types of > objects such as hash tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12585) The numFields of UnsafeRow should not changed by pointTo()
[ https://issues.apache.org/jira/browse/SPARK-12585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12585: --- Component/s: SQL > The numFields of UnsafeRow should not changed by pointTo() > -- > > Key: SPARK-12585 > URL: https://issues.apache.org/jira/browse/SPARK-12585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > Fix For: 2.0.0 > > > Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes > is calculated, making pointTo() a little bit heavy. > It should be part of constructor of UnsafeRow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13213) BroadcastNestedLoopJoin is very slow
[ https://issues.apache.org/jira/browse/SPARK-13213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137551#comment-15137551 ] Davies Liu commented on SPARK-13213: [~sowen] Thanks very much for update these, I try to remember to add that recently, but may still missed sometimes. Can we mark that as required (or remember the last action as default value)? > BroadcastNestedLoopJoin is very slow > > > Key: SPARK-13213 > URL: https://issues.apache.org/jira/browse/SPARK-13213 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > Since we have improve the performance of CartisianProduct, which should be > faster and robuster than BroacastNestedLoopJoin, we should do > CartisianProduct instead of BroacastNestedLoopJoin, especially when the > broadcasted table is not that small. > Today, we hit a query that take very long time but still not finished, once > decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it > just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8964) Use Exchange in limit operations (per partition limit -> exchange to one partition -> per partition limit)
[ https://issues.apache.org/jira/browse/SPARK-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8964. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 7334 [https://github.com/apache/spark/pull/7334] > Use Exchange in limit operations (per partition limit -> exchange to one > partition -> per partition limit) > -- > > Key: SPARK-8964 > URL: https://issues.apache.org/jira/browse/SPARK-8964 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > Fix For: 2.0.0 > > > Spark SQL's physical Limit operator currently performs its own shuffle rather > than using Exchange to perform the shuffling. This is less efficient since > this non-exchange shuffle path won't be able to benefit from SQL-specific > shuffling optimizations, such as SQLSerializer2. It also involves additional > unnecessary row copying. > Instead, I think that we should rewrite Limit to expand into three physical > operators: > PerParititonLimit -> Exchange to one partition -> PerPartitionLimit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13234) Remove duplicated SQL metrics
Davies Liu created SPARK-13234: -- Summary: Remove duplicated SQL metrics Key: SPARK-13234 URL: https://issues.apache.org/jira/browse/SPARK-13234 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows. After we improve the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary. Some of the operator does not have SQL metrics, we should add that for them. For those operators that have the same number of rows from input and output (for example, Projection, we may don't need that). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13215) Remove fallback in codegen
[ https://issues.apache.org/jira/browse/SPARK-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13215. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11097 [https://github.com/apache/spark/pull/11097] > Remove fallback in codegen > -- > > Key: SPARK-13215 > URL: https://issues.apache.org/jira/browse/SPARK-13215 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > Fix For: 2.0.0 > > > in newMutableProjection, it will fallback to InterpretedMutableProjection if > failed to compile. > Since we remove the configuration for codegen, we are heavily reply on > codegen (also TungstenAggregate require the generated MutableProjection to > update UnsafeRow), should remove the fallback, which could make user > confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13217) UI of visualization of stage
[ https://issues.apache.org/jira/browse/SPARK-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-13217. -- Resolution: Not A Problem It works after restart the driver. > UI of visualization of stage > - > > Key: SPARK-13217 > URL: https://issues.apache.org/jira/browse/SPARK-13217 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Davies Liu > > http://localhost:4041/stages/stage/?id=36&attempt=0&expandDagViz=true > {code} > HTTP ERROR 500 > Problem accessing /stages/stage/. Reason: > Server Error > Caused by: > java.lang.IncompatibleClassChangeError: vtable stub > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) > at > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) > at > org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) > at org.eclipse.jetty.server.Server.handle(Server.java:370) > at > org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) > at > org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) > at > org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > at java.lang.Thread.run(Thread.java:745) > Powered by Jetty:// > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13217) UI of visualization of stage
Davies Liu created SPARK-13217: -- Summary: UI of visualization of stage Key: SPARK-13217 URL: https://issues.apache.org/jira/browse/SPARK-13217 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Davies Liu http://localhost:4041/stages/stage/?id=36&attempt=0&expandDagViz=true {code} HTTP ERROR 500 Problem accessing /stages/stage/. Reason: Server Error Caused by: java.lang.IncompatibleClassChangeError: vtable stub at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:102) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:79) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:79) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.GzipHandler.handle(GzipHandler.java:264) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Powered by Jetty:// {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13215) Remove fallback in codegen
Davies Liu created SPARK-13215: -- Summary: Remove fallback in codegen Key: SPARK-13215 URL: https://issues.apache.org/jira/browse/SPARK-13215 Project: Spark Issue Type: Improvement Reporter: Davies Liu in newMutableProjection, it will fallback to InterpretedMutableProjection if failed to compile. Since we remove the configuration for codegen, we are heavily reply on codegen (also TungstenAggregate require the generated MutableProjection to update UnsafeRow), should remove the fallback, which could make user confusing, see the discussion in SPARK-13116. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134979#comment-15134979 ] Davies Liu commented on SPARK-13116: I tried the test you attached, it work well on master. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows
[ https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134959#comment-15134959 ] Davies Liu commented on SPARK-13116: Which UnaryExpression generate incorrect code? I think we may just remove the fallback. > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows > --- > > Key: SPARK-13116 > URL: https://issues.apache.org/jira/browse/SPARK-13116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Asif Hussain Shahid > Attachments: SPARK_13116_Test.scala > > > TungstenAggregate though it is supposedly capable of all processing unsafe & > safe rows, fails if the input is safe rows. > If the input to TungstenAggregateIterator is a SafeRow, while the target is > an UnsafeRow , the current code will try to set the fields in the UnsafeRow > using the update method in UnSafeRow. > This method is called via TunsgtenAggregateIterator on the > InterpretedMutableProjection. The target row in the > InterpretedMutableProjection is an UnsafeRow, while the current row is a > SafeRow. > In the InterpretedMutableProjection's apply method, it invokes > mutableRow(i) = exprArray(i).eval(input) > Now for UnsafeRow, the update method throws UnsupportedOperationException. > The proposed fix I did for our forked branch , on the class > InterpretedProjection is: > + private var targetUnsafe = false > + type UnsafeSetter = (UnsafeRow, Any ) => Unit > + private var setters : Array[UnsafeSetter] = _ > private[this] val exprArray = expressions.toArray > private[this] var mutableRow: MutableRow = new > GenericMutableRow(exprArray.length) > def currentValue: InternalRow = mutableRow > > + > override def target(row: MutableRow): MutableProjection = { > mutableRow = row > +targetUnsafe = row match { > + case _:UnsafeRow =>{ > +if(setters == null) { > + setters = Array.ofDim[UnsafeSetter](exprArray.length) > + for(i <- 0 until exprArray.length) { > +setters(i) = exprArray(i).dataType match { > + case IntegerType => (target: UnsafeRow, value: Any ) => > +target.setInt(i,value.asInstanceOf[Int]) > + case LongType => (target: UnsafeRow, value: Any ) => > +target.setLong(i,value.asInstanceOf[Long]) > + case DoubleType => (target: UnsafeRow, value: Any ) => > +target.setDouble(i,value.asInstanceOf[Double]) > + case FloatType => (target: UnsafeRow, value: Any ) => > +target.setFloat(i,value.asInstanceOf[Float]) > + > + case NullType => (target: UnsafeRow, value: Any ) => > +target.setNullAt(i) > + > + case BooleanType => (target: UnsafeRow, value: Any ) => > +target.setBoolean(i,value.asInstanceOf[Boolean]) > + > + case ByteType => (target: UnsafeRow, value: Any ) => > +target.setByte(i,value.asInstanceOf[Byte]) > + case ShortType => (target: UnsafeRow, value: Any ) => > +target.setShort(i,value.asInstanceOf[Short]) > + > +} > + } > +} > +true > + } > + case _ => false > +} > + > this > } > > override def apply(input: InternalRow): InternalRow = { > var i = 0 > while (i < exprArray.length) { > - mutableRow(i) = exprArray(i).eval(input) > + if(targetUnsafe) { > +setters(i)(mutableRow.asInstanceOf[UnsafeRow], > exprArray(i).eval(input)) > + }else { > +mutableRow(i) = exprArray(i).eval(input) > + } > i += 1 > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13213) BroadcastNestedLoopJoin is very slow
Davies Liu created SPARK-13213: -- Summary: BroadcastNestedLoopJoin is very slow Key: SPARK-13213 URL: https://issues.apache.org/jira/browse/SPARK-13213 Project: Spark Issue Type: Improvement Reporter: Davies Liu Since we have improve the performance of CartisianProduct, which should be faster and robuster than BroacastNestedLoopJoin, we should do CartisianProduct instead of BroacastNestedLoopJoin, especially when the broadcasted table is not that small. Today, we hit a query that take very long time but still not finished, once decrease the threshold for broadcast (disable BroacastNestedLoopJoin), it just finished in seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13210) NPE in Sort
[ https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-13210: -- Assignee: Davies Liu > NPE in Sort > --- > > Key: SPARK-13210 > URL: https://issues.apache.org/jira/browse/SPARK-13210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > When run TPCDS query Q78 with scale 10: > {code} > 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = > 268435456 bytes, TID = 143 > 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID > 143) > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13210) NPE in Sort
Davies Liu created SPARK-13210: -- Summary: NPE in Sort Key: SPARK-13210 URL: https://issues.apache.org/jira/browse/SPARK-13210 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Davies Liu Priority: Critical When run TPCDS query Q78 with scale 10: {code} 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 268435456 bytes, TID = 143 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 143) java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39) at org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45) at org.apache.spark.scheduler.Task.run(Task.scala:81) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13113) Remove unnecessary bit operation when decoding page number
[ https://issues.apache.org/jira/browse/SPARK-13113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13113. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11002 [https://github.com/apache/spark/pull/11002] > Remove unnecessary bit operation when decoding page number > -- > > Key: SPARK-13113 > URL: https://issues.apache.org/jira/browse/SPARK-13113 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.0.0 > > > As we shift bits right when decoding page number, looks like the bitwise AND > operation is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13131) Use best time and average time in micro benchmark
[ https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13131. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11018 [https://github.com/apache/spark/pull/11018] > Use best time and average time in micro benchmark > -- > > Key: SPARK-13131 > URL: https://issues.apache.org/jira/browse/SPARK-13131 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Best time should be more stable than average time in benchmark, together with > average time, they could show more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13131) Use best time and average time in micro benchmark
[ https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131413#comment-15131413 ] Davies Liu commented on SPARK-13131: [~srowen] Fully agreed with you, that's my first thought. Some people have concern that the best time will lost some information that some algorithm or implementation will be larger variance that others, for example, it generate lots of garbage, but GC it's not trigger. So we also include the mean time, for reference. > Use best time and average time in micro benchmark > -- > > Key: SPARK-13131 > URL: https://issues.apache.org/jira/browse/SPARK-13131 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Best time should be more stable than average time in benchmark, together with > average time, they could show more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13174) Add API and options for csv data sources
Davies Liu created SPARK-13174: -- Summary: Add API and options for csv data sources Key: SPARK-13174 URL: https://issues.apache.org/jira/browse/SPARK-13174 Project: Spark Issue Type: New Feature Reporter: Davies Liu We should have a API to load csv data source (with some options as arguments), similar to json() and jdbc() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13173) Fail to load CSV file with NPE
Davies Liu created SPARK-13173: -- Summary: Fail to load CSV file with NPE Key: SPARK-13173 URL: https://issues.apache.org/jira/browse/SPARK-13173 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} id|end_date|start_date|location 1|2015-10-14 00:00:00|2015-09-14 00:00:00|CA-SF 2|2015-10-15 01:00:20|2015-08-14 00:00:00|CA-SD 3|2015-10-16 02:30:00|2015-01-14 00:00:00|NY-NY 4|2015-10-17 03:00:20|2015-02-14 00:00:00|NY-NY 5|2015-10-18 04:30:00|2014-04-14 00:00:00|CA-SD {code} {code} adult_df = sqlContext.read.\ format("org.apache.spark.sql.execution.datasources.csv").\ option("header", "false").option("delimiter", "|").\ option("inferSchema", "true").load("/tmp/dataframe_sample.csv") {code} {code} Py4JJavaError: An error occurred while calling o239.load. : java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) at scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:93) at scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:108) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.inferSchema(CSVRelation.scala:137) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema$lzycompute(CSVRelation.scala:50) at org.apache.spark.sql.execution.datasources.csv.CSVRelation.dataSchema(CSVRelation.scala:48) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:666) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:665) at org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:39) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:115) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:136) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:290) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12992) Vectorize parquet decoding using ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131336#comment-15131336 ] Davies Liu commented on SPARK-12992: [~nongli] We usually have one PR for one JIRA, the JIRA will be closed as resolved by the merge tools automatically. > Vectorize parquet decoding using ColumnarBatch > -- > > Key: SPARK-12992 > URL: https://issues.apache.org/jira/browse/SPARK-12992 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Apache Spark > Fix For: 2.0.0 > > > Parquet files benefit from vectorized decoding. ColumnarBatches have been > designed to support this. This means that a single encoded parquet column is > decoded to a single ColumnVector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13131) Use best time and average time in micro benchmark
[ https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13131: --- Description: Best time should be more stable than average time in benchmark, together with average time, they could show more information. (was: Median time should be more stable than average time in benchmark.) > Use best time and average time in micro benchmark > -- > > Key: SPARK-13131 > URL: https://issues.apache.org/jira/browse/SPARK-13131 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Best time should be more stable than average time in benchmark, together with > average time, they could show more information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13131) Use best time and average time in micro benchmark
[ https://issues.apache.org/jira/browse/SPARK-13131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13131: --- Summary: Use best time and average time in micro benchmark (was: Use median time in benchmark) > Use best time and average time in micro benchmark > -- > > Key: SPARK-13131 > URL: https://issues.apache.org/jira/browse/SPARK-13131 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Median time should be more stable than average time in benchmark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org