[jira] [Resolved] (SPARK-25635) Support selective direct encoding in native ORC write
[ https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25635. - Resolution: Fixed Fix Version/s: 3.0.0 > Support selective direct encoding in native ORC write > - > > Key: SPARK-25635 > URL: https://issues.apache.org/jira/browse/SPARK-25635 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Before ORC 1.5.3, `orc.dictionary.key.threshold` and > `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. > This is a big huddle to enable dictionary encoding. > From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct > encoding selectively in a column-wise manner. This issue aims to add that > feature by upgrading ORC from 1.5.2 to 1.5.3. > The followings are the patches in ORC 1.5.3 and this feature is the only one > related to Spark directly. > {code} > ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts > multi-byte data (gopalv) > ORC-403: [C++] Add checks to avoid invalid offsets in InputStream > ORC-405. Remove calcite as a dependency from the benchmarks. > ORC-375: Fix libhdfs on gcc7 by adding #include two places. > ORC-383: Parallel builds fails with ConcurrentModificationException > ORC-382: Apache rat exclusions + add rat check to travis > ORC-401: Fix incorrect quoting in specification. > ORC-385. Change RecordReader to extend Closeable. > ORC-384: [C++] fix memory leak when loading non-ORC files > ORC-391: [c++] parseType does not accept underscore in the field name > ORC-397. Allow selective disabling of dictionary encoding. Original patch was > by Mithun Radhakrishnan. > ORC-389: Add ability to not decode Acid metadata columns > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25626) HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec
[ https://issues.apache.org/jira/browse/SPARK-25626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25626. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 3.0.0 > HiveClientSuites: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false 46 sec > -- > > Key: SPARK-25626 > URL: https://issues.apache.org/jira/browse/SPARK-25626 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > HiveClientSuite.2.3: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false46 sec Passed > HiveClientSuite.2.2: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false45 sec Passed > HiveClientSuite.2.1: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false42 sec Passed > HiveClientSuite.2.0: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false39 sec Passed > HiveClientSuite.1.2: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false37 sec Passed > HiveClientSuite.1.1: getPartitionsByFilter returns all partitions when > hive.metastore.try.direct.sql=false36 sec Passed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20845) Support specification of column names in INSERT INTO
[ https://issues.apache.org/jira/browse/SPARK-20845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20845: Target Version/s: 3.0.0 > Support specification of column names in INSERT INTO > > > Key: SPARK-20845 > URL: https://issues.apache.org/jira/browse/SPARK-20845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Minor > > Some databases allow you to specify column names when specifying the target > of an INSERT INTO. For example, in SQLite: > {code} > sqlite> CREATE TABLE twocolumn (x INT, y INT); INSERT INTO twocolumn(x, y) > VALUES (44,51), (NULL,52), (42,53), (45,45) >...> ; > sqlite> select * from twocolumn; > 44|51 > |52 > 42|53 > 45|45 > {code} > I have a corpus of existing queries of this form which I would like to run on > Spark SQL, so I think we should extend our dialect to support this syntax. > When implementing this, we should make sure to test the following behaviors > and corner-cases: > - Number of columns specified is greater than or less than the number of > columns in the table. > - Specification of repeated columns. > - Specification of columns which do not exist in the target table. > - Permute column order instead of using the default order in the table. > For each of these, we should check how SQLite behaves and should also compare > against another database. It looks like T-SQL supports this; see > https://technet.microsoft.com/en-us/library/dd776381(v=sql.105).aspx under > the "Inserting data that is not in the same order as the table columns" > header. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25655) Add Pspark-ganglia-lgpl to the scala style check
Xiao Li created SPARK-25655: --- Summary: Add Pspark-ganglia-lgpl to the scala style check Key: SPARK-25655 URL: https://issues.apache.org/jira/browse/SPARK-25655 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Xiao Li Our lint failed due to the following errors: {code} [INFO] --- scalastyle-maven-plugin:1.0.0:check (default) @ spark-ganglia-lgpl_2.11 --- error file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala message= Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with // scalastyle:off caselocale .toUpperCase .toLowerCase // scalastyle:on caselocale line=67 column=49 error file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala message= Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with // scalastyle:off caselocale .toUpperCase .toLowerCase // scalastyle:on caselocale line=71 column=32 Saving to outputFile=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/target/scalastyle-output.xml {code} See https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/8890/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25606) DateExpressionsSuite: Hour 1 min
[ https://issues.apache.org/jira/browse/SPARK-25606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25606. - Resolution: Fixed Fix Version/s: 3.0.0 > DateExpressionsSuite: Hour 1 min > > > Key: SPARK-25606 > URL: https://issues.apache.org/jira/browse/SPARK-25606 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25609) DataFrameSuite: SPARK-22226: splitExpressions should not generate codes beyond 64KB 49 seconds
[ https://issues.apache.org/jira/browse/SPARK-25609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25609. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 3.0.0 > DataFrameSuite: SPARK-6: splitExpressions should not generate codes > beyond 64KB 49 seconds > -- > > Key: SPARK-25609 > URL: https://issues.apache.org/jira/browse/SPARK-25609 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.sql.DataFrameSuite.SPARK-6: splitExpressions should not > generate codes beyond 64KB 49 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25605) CastSuite: cast string to timestamp 2 mins 31 sec
[ https://issues.apache.org/jira/browse/SPARK-25605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25605. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 3.0.0 > CastSuite: cast string to timestamp 2 mins 31 sec > - > > Key: SPARK-25605 > URL: https://issues.apache.org/jira/browse/SPARK-25605 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.sql.catalyst.expressions.CastSuite.cast string to timestamp > took 2 min 31 secs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25606) DateExpressionsSuite: Hour 1 min
[ https://issues.apache.org/jira/browse/SPARK-25606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25606: --- Assignee: Yuming Wang > DateExpressionsSuite: Hour 1 min > > > Key: SPARK-25606 > URL: https://issues.apache.org/jira/browse/SPARK-25606 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Yuming Wang >Priority: Major > > org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.
Xiao Li created SPARK-25632: --- Summary: KafkaRDDSuite: compacted topic 2 min 5 sec. Key: SPARK-25632 URL: https://issues.apache.org/jira/browse/SPARK-25632 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic Took 2 min 5 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec
Xiao Li created SPARK-25631: --- Summary: KafkaRDDSuite: basic usage2 min 4 sec Key: SPARK-25631 URL: https://issues.apache.org/jira/browse/SPARK-25631 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage Took 2 min 4 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25630) HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec
Xiao Li created SPARK-25630: --- Summary: HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec Key: SPARK-25630 URL: https://issues.apache.org/jira/browse/SPARK-25630 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.SPARK-8406: Avoids name collision while writing files Took 21 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25629) ParquetFilterSuite: filter pushdown - decimal 16 sec
Xiao Li created SPARK-25629: --- Summary: ParquetFilterSuite: filter pushdown - decimal 16 sec Key: SPARK-25629 URL: https://issues.apache.org/jira/browse/SPARK-25629 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite.filter pushdown - decimal Took 16 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25628) DistributedSuite: recover from repeated node failures during shuffle-reduce 40 seconds
Xiao Li created SPARK-25628: --- Summary: DistributedSuite: recover from repeated node failures during shuffle-reduce 40 seconds Key: SPARK-25628 URL: https://issues.apache.org/jira/browse/SPARK-25628 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.DistributedSuite.recover from repeated node failures during shuffle-reduce 40 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25627) ContinuousStressSuite - 8 mins 13 sec
Xiao Li created SPARK-25627: --- Summary: ContinuousStressSuite - 8 mins 13 sec Key: SPARK-25627 URL: https://issues.apache.org/jira/browse/SPARK-25627 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li ContinuousStressSuite - 8 mins 13 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25626) HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec
Xiao Li created SPARK-25626: --- Summary: HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec Key: SPARK-25626 URL: https://issues.apache.org/jira/browse/SPARK-25626 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li HiveClientSuite.2.3: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec Passed HiveClientSuite.2.2: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 45 sec Passed HiveClientSuite.2.1: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 42 sec Passed HiveClientSuite.2.0: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 39 sec Passed HiveClientSuite.1.2: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 37 sec Passed HiveClientSuite.1.1: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 36 sec Passed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25625) LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization - 33 sec
Xiao Li created SPARK-25625: --- Summary: LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization - 33 sec Key: SPARK-25625 URL: https://issues.apache.org/jira/browse/SPARK-25625 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization Took 33 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25624) LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization 56 seconds
Xiao Li created SPARK-25624: --- Summary: LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization 56 seconds Key: SPARK-25624 URL: https://issues.apache.org/jira/browse/SPARK-25624 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization Took 56 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25623) LogisticRegressionSuite: multinomial logistic regression with intercept with L1 regularization 1 min 10 sec
Xiao Li created SPARK-25623: --- Summary: LogisticRegressionSuite: multinomial logistic regression with intercept with L1 regularization 1 min 10 sec Key: SPARK-25623 URL: https://issues.apache.org/jira/browse/SPARK-25623 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic regression with intercept with L1 regularization Took 1 min 10 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25622) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables with bucket pruning filters - 42 seconds
Xiao Li created SPARK-25622: --- Summary: BucketedReadWithHiveSupportSuite: read partitioning bucketed tables with bucket pruning filters - 42 seconds Key: SPARK-25622 URL: https://issues.apache.org/jira/browse/SPARK-25622 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning bucketed tables with bucket pruning filters Took 42 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25621) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables having composite filters 45 sec
Xiao Li created SPARK-25621: --- Summary: BucketedReadWithHiveSupportSuite: read partitioning bucketed tables having composite filters 45 sec Key: SPARK-25621 URL: https://issues.apache.org/jira/browse/SPARK-25621 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning bucketed tables having composite filters Took 45 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds
[ https://issues.apache.org/jira/browse/SPARK-25620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25620: Description: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure recovery Took 1 min 36 sec. org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure recovery Took 1 min 24 sec. was: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure recovery Took 1 min 36 sec. > WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds > > > Key: SPARK-25620 > URL: https://issues.apache.org/jira/browse/SPARK-25620 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure > recovery > Took 1 min 36 sec. > org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure > recovery > Took 1 min 24 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec
[ https://issues.apache.org/jira/browse/SPARK-25619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25619: Description: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and merge shards in a stream 2 min 15 sec org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split and merge shards in a stream 1 min 52 sec. was: org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and merge shards in a stream 2 min 15 sec > WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min > 15 sec > -- > > Key: SPARK-25619 > URL: https://issues.apache.org/jira/browse/SPARK-25619 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > > org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split > and merge shards in a stream 2 min 15 sec > org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split > and merge shards in a stream 1 min 52 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds
Xiao Li created SPARK-25620: --- Summary: WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds Key: SPARK-25620 URL: https://issues.apache.org/jira/browse/SPARK-25620 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure recovery Took 1 min 36 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec
Xiao Li created SPARK-25619: --- Summary: WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec Key: SPARK-25619 URL: https://issues.apache.org/jira/browse/SPARK-25619 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and merge shards in a stream 2 min 15 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25618) KafkaContinuousSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false 1 min 1 sec
Xiao Li created SPARK-25618: --- Summary: KafkaContinuousSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false 1 min 1 sec Key: SPARK-25618 URL: https://issues.apache.org/jira/browse/SPARK-25618 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaContinuousSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false 1 min 1 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25617) KafkaContinuousSinkSuite: generic - write big data with small producer buffer 56 secs
Xiao Li created SPARK-25617: --- Summary: KafkaContinuousSinkSuite: generic - write big data with small producer buffer 56 secs Key: SPARK-25617 URL: https://issues.apache.org/jira/browse/SPARK-25617 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaContinuousSinkSuite.generic - write big data with small producer buffer 56 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25616) KafkaSinkSuite: generic - write big data with small producer buffer 57 secs
Xiao Li created SPARK-25616: --- Summary: KafkaSinkSuite: generic - write big data with small producer buffer 57 secs Key: SPARK-25616 URL: https://issues.apache.org/jira/browse/SPARK-25616 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaSinkSuite.generic - write big data with small producer buffer 57 secs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25615) KafkaSinkSuite: streaming - write to non-existing topic 1 min
Xiao Li created SPARK-25615: --- Summary: KafkaSinkSuite: streaming - write to non-existing topic 1 min Key: SPARK-25615 URL: https://issues.apache.org/jira/browse/SPARK-25615 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.kafka010.KafkaSinkSuite.streaming - write to non-existing topic 1 min -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25614) HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds
Xiao Li created SPARK-25614: --- Summary: HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds Key: SPARK-25614 URL: https://issues.apache.org/jira/browse/SPARK-25614 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25613) HiveSparkSubmitSuite: dir 1 min 3 seconds
Xiao Li created SPARK-25613: --- Summary: HiveSparkSubmitSuite: dir 1 min 3 seconds Key: SPARK-25613 URL: https://issues.apache.org/jira/browse/SPARK-25613 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.HiveSparkSubmitSuite.dir 1 mins 3 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25612) CompressionCodecSuite: table-level compression is not set but session-level compressions 47 seconds
Xiao Li created SPARK-25612: --- Summary: CompressionCodecSuite: table-level compression is not set but session-level compressions 47 seconds Key: SPARK-25612 URL: https://issues.apache.org/jira/browse/SPARK-25612 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.CompressionCodecSuite.table-level compression is not set but session-level compressions is set 47 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25611) CompressionCodecSuite: both table-level and session-level compression are set 2 min 20 sec
Xiao Li created SPARK-25611: --- Summary: CompressionCodecSuite: both table-level and session-level compression are set 2 min 20 sec Key: SPARK-25611 URL: https://issues.apache.org/jira/browse/SPARK-25611 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.CompressionCodecSuite.both table-level and session-level compression are set: 2 min 20 sec -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25610) DatasetCacheSuite: cache UDF result correctly 25 seconds
Xiao Li created SPARK-25610: --- Summary: DatasetCacheSuite: cache UDF result correctly 25 seconds Key: SPARK-25610 URL: https://issues.apache.org/jira/browse/SPARK-25610 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.DatasetCacheSuite.cache UDF result correctly 25 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25609) DataFrameSuite: SPARK-22226: splitExpressions should not generate codes beyond 64KB 49 seconds
Xiao Li created SPARK-25609: --- Summary: DataFrameSuite: SPARK-6: splitExpressions should not generate codes beyond 64KB 49 seconds Key: SPARK-25609 URL: https://issues.apache.org/jira/browse/SPARK-25609 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.DataFrameSuite.SPARK-6: splitExpressions should not generate codes beyond 64KB 49 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25608) HashAggregationQueryWithControlledFallbackSuite: multiple distinct multiple columns sets 38 seconds
Xiao Li created SPARK-25608: --- Summary: HashAggregationQueryWithControlledFallbackSuite: multiple distinct multiple columns sets 38 seconds Key: SPARK-25608 URL: https://issues.apache.org/jira/browse/SPARK-25608 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.multiple distinct multiple columns sets 38 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25607) HashAggregationQueryWithControlledFallbackSuite: single distinct column set 42 seconds
Xiao Li created SPARK-25607: --- Summary: HashAggregationQueryWithControlledFallbackSuite: single distinct column set 42 seconds Key: SPARK-25607 URL: https://issues.apache.org/jira/browse/SPARK-25607 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.single distinct column set 42 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25606) DateExpressionsSuite: Hour 1 min
Xiao Li created SPARK-25606: --- Summary: DateExpressionsSuite: Hour 1 min Key: SPARK-25606 URL: https://issues.apache.org/jira/browse/SPARK-25606 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25605) CastSuite: cast string to timestamp 2 mins 31 sec
Xiao Li created SPARK-25605: --- Summary: CastSuite: cast string to timestamp 2 mins 31 sec Key: SPARK-25605 URL: https://issues.apache.org/jira/browse/SPARK-25605 Project: Spark Issue Type: Sub-task Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li org.apache.spark.sql.catalyst.expressions.CastSuite.cast string to timestamp took 2 min 31 secs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25604) Reduce the overall time costs in Jenkins tests
Xiao Li created SPARK-25604: --- Summary: Reduce the overall time costs in Jenkins tests Key: SPARK-25604 URL: https://issues.apache.org/jira/browse/SPARK-25604 Project: Spark Issue Type: Umbrella Components: Tests Affects Versions: 3.0.0 Reporter: Xiao Li Currently, our Jenkins tests took almost 5 hours. To reduce the test time, below is my suggestion: * split the tests to multiple individual Jenkins jobs * tune the confs in the test framework; * for the slow test cases, we can rewrite the test cases or even optimize the source code to speed up them; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24499: Target Version/s: 3.0.0 > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635836#comment-16635836 ] Xiao Li commented on SPARK-24499: - [~XuanYuan] Yeah. Let us do the split first, and then discuss how to enrich our doc. We need to add a lot of stuffs, if we compare our doc with the other popular OSS DBMS project, e.g., Postgres, MySQL and so on. > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25414) make it clear that the numRows metrics should be counted for each scan of the source
[ https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25414: Fix Version/s: (was: 2.5.0) > make it clear that the numRows metrics should be counted for each scan of the > source > > > Key: SPARK-25414 > URL: https://issues.apache.org/jira/browse/SPARK-25414 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25426) Remove the duplicate fallback logic in UnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25426: Fix Version/s: (was: 2.5.0) > Remove the duplicate fallback logic in UnsafeProjection > --- > > Key: SPARK-25426 > URL: https://issues.apache.org/jira/browse/SPARK-25426 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25381) Stratified sampling by Column argument
[ https://issues.apache.org/jira/browse/SPARK-25381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25381: --- Assignee: Maxim Gekk > Stratified sampling by Column argument > -- > > Key: SPARK-25381 > URL: https://issues.apache.org/jira/browse/SPARK-25381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently the sampleBy method accepts the first argument of string type only. > Need to provide overloaded method which accepts Column type too. So, it will > allow sampling by multiple columns , for example: > {code:scala} > import org.apache.spark.sql.Row > import org.apache.spark.sql.functions.struct > val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), > ("Bob", 17), > ("Alice", 10))).toDF("name", "age") > val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0) > df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() >+-+---+ >| name|age| >+-+---+ >| Nico| 8| >|Alice| 10| >+-+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25415: Fix Version/s: (was: 2.5.0) > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > Fix For: 3.0.0 > > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25423: Fix Version/s: (was: 2.5.0) > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Assignee: Yuming Wang >Priority: Trivial > Labels: starter > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25449) Don't send zero accumulators in heartbeats
[ https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25449: Fix Version/s: (was: 2.5.0) 3.0.0 > Don't send zero accumulators in heartbeats > -- > > Key: SPARK-25449 > URL: https://issues.apache.org/jira/browse/SPARK-25449 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > Fix For: 3.0.0 > > > Heartbeats sent from executors to the driver every 10 seconds contain metrics > and are generally on the order of a few KBs. However, for large jobs with > lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks > to die with heartbeat failures. We can mitigate this by not sending zero > metrics to the driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully
[ https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25472: Fix Version/s: (was: 2.5.0) 3.0.0 > Structured Streaming query.stop() doesn't always stop gracefully > > > Key: SPARK-25472 > URL: https://issues.apache.org/jira/browse/SPARK-25472 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > We can have race conditions where the cancelling of Spark jobs will throw a > SparkException when stopping a streaming query. This SparkException specifies > that the job was cancelled. We can use this error message to swallow the > error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE
[ https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25458: Fix Version/s: (was: 2.5.0) 3.0.0 > Support FOR ALL COLUMNS in ANALYZE TABLE > - > > Key: SPARK-25458 > URL: https://issues.apache.org/jira/browse/SPARK-25458 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > Currently, to collect the statistics of all the columns, users need to > specify the names of all the columns when calling the command "ANALYZE TABLE > ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the > following SQL command to achieve it without specifying the column names. > {code:java} >ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25457) IntegralDivide (div) should not always return long
[ https://issues.apache.org/jira/browse/SPARK-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25457: Fix Version/s: (was: 2.5.0) 3.0.0 > IntegralDivide (div) should not always return long > -- > > Key: SPARK-25457 > URL: https://issues.apache.org/jira/browse/SPARK-25457 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > The operation {{div}} returns always long. This came from Hive's behavior, > which is different to the one of most of other DBMS (eg. MySQL, Postgres) > which return as a datatype the same of the operands. > This JIRA tracks changing our return type and allowing the users to re-enable > the old behavior using {{spark.sql.legacy.integralDivide.returnBigint}}. > I'll submit a PR for this soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25429: Fix Version/s: (was: 2.5.0) > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method
[ https://issues.apache.org/jira/browse/SPARK-25444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25444: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor GenArrayData.genCodeToCreateArrayData() method > --- > > Key: SPARK-25444 > URL: https://issues.apache.org/jira/browse/SPARK-25444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 3.0.0 > > > {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a > temporary Java array to create {{ArrayData}}. It can be eliminated by using > {{ArrayData.createArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25447) Support JSON options by schema_of_json
[ https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25447: Fix Version/s: (was: 2.5.0) 3.0.0 > Support JSON options by schema_of_json > -- > > Key: SPARK-25447 > URL: https://issues.apache.org/jira/browse/SPARK-25447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The function schema_of_json doesn't accept any options currently but the > options can impact on schema inferring. Need to support the same options that > from_json() can use on schema inferring. Here is examples of options that > could impact on schema inferring: > * primitivesAsString > * prefersDecimal > * allowComments > * allowUnquotedFieldNames > * allowSingleQuotes > * allowNumericLeadingZeros > * allowNonNumericNumbers > * allowBackslashEscapingAnyCharacter > * allowUnquotedControlChars > Below is possible signature: > {code:scala} > def schema_of_json(e: Column, options: java.util.Map[String, String]): Column > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25465) Refactor Parquet test suites in project Hive
[ https://issues.apache.org/jira/browse/SPARK-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25465: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor Parquet test suites in project Hive > > > Key: SPARK-25465 > URL: https://issues.apache.org/jira/browse/SPARK-25465 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Current the file > parquetSuites.scala(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala) > is not recognizable. > When I tried to find test suites for built-in Parquet conversions for Hive > serde, I can only find > HiveParquetSuite(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala) > in the first few minutes. > The file name and test suite naming can be revised. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25476) Refactor AggregateBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25476: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor AggregateBenchmark to use main method > -- > > Key: SPARK-25476 > URL: https://issues.apache.org/jira/browse/SPARK-25476 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25473) PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria
[ https://issues.apache.org/jira/browse/SPARK-25473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25473: Fix Version/s: (was: 2.5.0) 3.0.0 > PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria > > > Key: SPARK-25473 > URL: https://issues.apache.org/jira/browse/SPARK-25473 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > {code} > PYSPARK_PYTHON=python3.6 SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests > SQLTests > {code} > {code} > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > /usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766: > ResourceWarning: subprocess 27563 is still running > ResourceWarning, source=self) > [Stage 0:> (0 + 1) / > 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in > progress in another thread when fork() was called. > objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in > progress in another thread when fork() was called. We cannot safely call it > or ignore it in the fork() child process. Crashing instead. Set a breakpoint > on objc_initializeAfterForkError to debug. > ERROR > == > ERROR: test_streaming_foreach_with_simple_function > (pyspark.sql.tests.SQLTests) > -- > Traceback (most recent call last): > File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > o54.processAllAvailable. > : org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted. > === Streaming Query === > Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = > 08d1435b-5358-4fb6-b167-811584a3163e] > Current Committed Offsets: {} > Current Available Offsets: > {FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]: > {"logOffset":0}} > Current State: ACTIVE > Thread State: RUNNABLE > Logical Plan: > FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s] > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: org.apache.spark.SparkException: Writing job aborted. > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2783) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$
[jira] [Updated] (SPARK-25486) Refactor SortBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25486: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor SortBenchmark to use main method > - > > Key: SPARK-25486 > URL: https://issues.apache.org/jira/browse/SPARK-25486 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25499) Refactor BenchmarkBase and Benchmark
[ https://issues.apache.org/jira/browse/SPARK-25499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25499: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor BenchmarkBase and Benchmark > > > Key: SPARK-25499 > URL: https://issues.apache.org/jira/browse/SPARK-25499 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Currently there are two classes with the same naming BenchmarkBase: > 1. org.apache.spark.util.BenchmarkBase > 2. org.apache.spark.sql.execution.benchmark.BenchmarkBase > Here I propose: > 1. the package org.apache.spark.util.BenchmarkBase should be in test package, > move to org.apache.spark.sql.execution.benchmark . > 2. Rename the org.apache.spark.sql.execution.benchmark.BenchmarkBase as > BenchmarkWithCodegen > 3. Move org.apache.spark.util.Benchmark to test package of > org.apache.spark.sql.execution.benchmark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25489) Refactor UDTSerializationBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25489: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor UDTSerializationBenchmark > -- > > Key: SPARK-25489 > URL: https://issues.apache.org/jira/browse/SPARK-25489 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 3.0.0 > > > Refactor UDTSerializationBenchmark to use main method and print the output as > a separate file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25481: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor ColumnarBatchBenchmark to use main method > -- > > Key: SPARK-25481 > URL: https://issues.apache.org/jira/browse/SPARK-25481 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25485) Refactor UnsafeProjectionBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25485: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor UnsafeProjectionBenchmark to use main method > - > > Key: SPARK-25485 > URL: https://issues.apache.org/jira/browse/SPARK-25485 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25478) Refactor CompressionSchemeBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25478: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor CompressionSchemeBenchmark to use main method > -- > > Key: SPARK-25478 > URL: https://issues.apache.org/jira/browse/SPARK-25478 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25494) Upgrade Spark's use of Janino to 3.0.10
[ https://issues.apache.org/jira/browse/SPARK-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25494: Fix Version/s: (was: 2.5.0) 3.0.0 > Upgrade Spark's use of Janino to 3.0.10 > --- > > Key: SPARK-25494 > URL: https://issues.apache.org/jira/browse/SPARK-25494 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 3.0.0 > > > This ticket proposes to upgrade Spark's use of Janino from 3.0.9 to 3.0.10. > Note that 3.0.10 is a out-of-band release specifically for fixing an integer > overflow issue in Janino's {{ClassFile}} reader. It is otherwise exactly the > same as 3.0.9, so it's a low risk and compatible upgrade. > The integer overflow issue affects Spark SQL's codegen stats collection: when > a generated Class file is huge, especially when the constant pool size is > above {{Short.MAX_VALUE}}, Janino's {{ClassFile}} reader will throw an > exception when Spark wants to parse the generated Class file to collect > stats. So we'll miss the stats of some huge Class files. > The Janino fix is tracked by this issue: > https://github.com/janino-compiler/janino/issues/58 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25487) Refactor PrimitiveArrayBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25487: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor PrimitiveArrayBenchmark > > > Key: SPARK-25487 > URL: https://issues.apache.org/jira/browse/SPARK-25487 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 3.0.0 > > > Refactor PrimitiveArrayBenchmark to use main method and print the output as a > separate file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25508) Refactor OrcReadBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25508: Fix Version/s: (was: 2.5.0) 3.0.0 > Refactor OrcReadBenchmark to use main method > > > Key: SPARK-25508 > URL: https://issues.apache.org/jira/browse/SPARK-25508 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.5.0 >Reporter: yucai >Assignee: yucai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25510) Create a new trait SqlBasedBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25510: Fix Version/s: (was: 2.5.0) 3.0.0 > Create a new trait SqlBasedBenchmark > - > > Key: SPARK-25510 > URL: https://issues.apache.org/jira/browse/SPARK-25510 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25534) Make `SQLHelper` trait
[ https://issues.apache.org/jira/browse/SPARK-25534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25534: Fix Version/s: (was: 2.5.0) 3.0.0 > Make `SQLHelper` trait > -- > > Key: SPARK-25534 > URL: https://issues.apache.org/jira/browse/SPARK-25534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark has 7 `withTempPath` and 6 `withSQLConf` functions. This PR > aims to remove duplicated and inconsistent code and reduce them to the > following meaningful implementations. > *withTempPath* > - `SQLHelper.withTempPath`: The one which was used in `SQLTestUtils`. > *withSQLConf* > - `SQLHelper.withSQLConf`: The one which was used in `PlanTest`. > - `ExecutorSideSQLConfSuite.withSQLConf`: The one which doesn't throw > `AnalysisException` on StaticConf changes. > - `SQLTestUtils.withSQLConf`: The one which overrides intentionally to change > the active session. > {code} > protected override def withSQLConf(pairs: (String, String)*)(f: => Unit): > Unit = { > SparkSession.setActiveSession(spark) > super.withSQLConf(pairs: _*)(f) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' operator
[ https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25541: Fix Version/s: (was: 2.5.0) 3.0.0 > CaseInsensitiveMap should be serializable after '-' operator > > > Key: SPARK-25541 > URL: https://issues.apache.org/jira/browse/SPARK-25541 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.
[ https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25540: Fix Version/s: (was: 2.5.0) 3.0.0 > Make HiveContext in PySpark behave as the same as Scala. > > > Key: SPARK-25540 > URL: https://issues.apache.org/jira/browse/SPARK-25540 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} > of the given {{SparkContext}} and then passes to {{SparkSession.builder}}. > The {{HiveContext}} in PySpark should behave as the same as it in Scala. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25525) Do not update conf for existing SparkContext in SparkSession.getOrCreate.
[ https://issues.apache.org/jira/browse/SPARK-25525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25525: Fix Version/s: (was: 2.5.0) 3.0.0 > Do not update conf for existing SparkContext in SparkSession.getOrCreate. > - > > Key: SPARK-25525 > URL: https://issues.apache.org/jira/browse/SPARK-25525 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > In SPARK-20946, we modified {{SparkSession.getOrCreate}} to not update conf > for existing {{SparkContext}} because {{SparkContext}} is shared by all > sessions. > We should not update it in PySpark side as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25551) Remove unused InSubquery expression
[ https://issues.apache.org/jira/browse/SPARK-25551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25551: Fix Version/s: (was: 2.5.0) 3.0.0 > Remove unused InSubquery expression > --- > > Key: SPARK-25551 > URL: https://issues.apache.org/jira/browse/SPARK-25551 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Trivial > Fix For: 3.0.0 > > > SPARK-16958 introduced a {{InSubquery}} expression. Its only usage was > removed in SPARK-18874. Hence now it is not used anymore and it can be > removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25514) Generating pretty JSON by to_json
[ https://issues.apache.org/jira/browse/SPARK-25514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25514: Fix Version/s: (was: 2.5.0) 3.0.0 > Generating pretty JSON by to_json > - > > Key: SPARK-25514 > URL: https://issues.apache.org/jira/browse/SPARK-25514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > It would be nice to have an option, for example *"pretty"*, which enable > special output mode for the to_json function. In the mode, produced JSON > string will have easily readable representation. For example: > {code:scala} > val json = > """[{"book":{"publisher":[{"country":"NL","year":[1981,1986,1999]}]}}]""" > to_json(from_json('col, ...), Map("pretty" -> "true"))) > [ { > "book" : { > "publisher" : [ { > "country" : "NL", > "year" : [ 1981, 1986, 1999 ] > } ] > } > } ] > {code} > There are at least two use cases: > # Exploring content of nested columns. For example, a result of your query is > a few rows, and some columns have deep nested structure. And you want to > analyze and find a value of one of nested fields. > # You already have an JSON in one of columns, and want to explore the JSON > records. New option will allow to do that easily without copy-past JSON > content to an editor by combining from_json and to_json functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25559) Just remove the unsupported predicates in Parquet
[ https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25559: Fix Version/s: (was: 2.5.0) 3.0.0 > Just remove the unsupported predicates in Parquet > - > > Key: SPARK-25559 > URL: https://issues.apache.org/jira/browse/SPARK-25559 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 3.0.0 > > > Currently, in *ParquetFilters*, if one of the children predicates is not > supported by Parquet, the entire predicates will be thrown away. In fact, if > the unsupported predicate is in the top level *And* condition or in the child > before hitting *Not* or *Or* condition, it's safe to just remove the > unsupported one as unhandled filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls
[ https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25565: Fix Version/s: (was: 2.5.0) 3.0.0 > Add scala style checker to check add Locale.ROOT to .toLowerCase and > .toUpperCase for internal calls > > > Key: SPARK-25565 > URL: https://issues.apache.org/jira/browse/SPARK-25565 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.5.0 >Reporter: Yuming Wang >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25575) SQL tab in the spark UI doesn't have option of hiding tables, eventhough other UI tabs has.
[ https://issues.apache.org/jira/browse/SPARK-25575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25575: Fix Version/s: (was: 2.5.0) 3.0.0 > SQL tab in the spark UI doesn't have option of hiding tables, eventhough > other UI tabs has. > - > > Key: SPARK-25575 > URL: https://issues.apache.org/jira/browse/SPARK-25575 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.1 >Reporter: shahid >Assignee: shahid >Priority: Minor > Fix For: 3.0.0 > > Attachments: Screenshot from 2018-09-29 23-26-45.png, Screenshot from > 2018-09-29 23-26-57.png > > > Test tests: > 1) bin/spark-shell > {code:java} > sql("create table a (id int)") > for(i <- 1 to 100) sql(s"insert into a values ($i)") > {code} > Open SQL tab in the web UI, > !Screenshot from 2018-09-29 23-26-45.png! > Open Jobs tab, > !Screenshot from 2018-09-29 23-26-57.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-25592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25592. - Resolution: Fixed Fix Version/s: 3.0.0 > Bump master branch version to 3.0.0-SNAPSHOT > > > Key: SPARK-25592 > URL: https://issues.apache.org/jira/browse/SPARK-25592 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 3.0.0 > > > This patch bumps the master branch version to `3.0.0-SNAPSHOT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT
Xiao Li created SPARK-25592: --- Summary: Bump master branch version to 3.0.0-SNAPSHOT Key: SPARK-25592 URL: https://issues.apache.org/jira/browse/SPARK-25592 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Xiao Li This patch bumps the master branch version to `3.0.0-SNAPSHOT`. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23210) Introduce the concept of default value to schema
[ https://issues.apache.org/jira/browse/SPARK-23210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23210: Target Version/s: 3.0.0 > Introduce the concept of default value to schema > > > Key: SPARK-23210 > URL: https://issues.apache.org/jira/browse/SPARK-23210 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: LvDongrong >Priority: Major > > There is no concept of DEFAULT VALUE for schema in spark now. > Our team want to support insert into serial columns of table,like "insert > into (a, c) values ("value1", "value2") for our use case, but the default > vaule of column is not definited. In hive, the default vaule of column is > NULL if we don't specify. > So I think maybe it is necessary to introduce the concept of default value to > schema in spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
[ https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25453. - Resolution: Fixed Fix Version/s: 2.4.0 > OracleIntegrationSuite IllegalArgumentException: Timestamp format must be > -mm-dd hh:mm:ss[.f] > - > > Key: SPARK-25453 > URL: https://issues.apache.org/jira/browse/SPARK-25453 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.4.0 > > > {noformat} > - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED *** > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:204) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > at > org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445) > at > org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427) > ...{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
[ https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25453: --- Assignee: Chenxiao Mao > OracleIntegrationSuite IllegalArgumentException: Timestamp format must be > -mm-dd hh:mm:ss[.f] > - > > Key: SPARK-25453 > URL: https://issues.apache.org/jira/browse/SPARK-25453 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Chenxiao Mao >Priority: Major > > {noformat} > - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED *** > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:204) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > at > org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445) > at > org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427) > ...{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25576) Fix lint failure in 2.2
Xiao Li created SPARK-25576: --- Summary: Fix lint failure in 2.2 Key: SPARK-25576 URL: https://issues.apache.org/jira/browse/SPARK-25576 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.2.2 Reporter: Xiao Li See the errors: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.2-lint/913/console -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator
[ https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25568. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.3 2.2.3 > Continue to update the remaining accumulators when failing to update one > accumulator > > > Key: SPARK-25568 > URL: https://issues.apache.org/jira/browse/SPARK-25568 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.2.3, 2.3.3, 2.4.0 > > > Currently when failing to update an accumulator, > DAGScheduler.updateAccumulators will skip the remaining accumulators. We > should try to update the remaining accumulators if possible so that they can > still report correct values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25573) Combine resolveExpression and resolve in the rule ResolveReferences
Xiao Li created SPARK-25573: --- Summary: Combine resolveExpression and resolve in the rule ResolveReferences Key: SPARK-25573 URL: https://issues.apache.org/jira/browse/SPARK-25573 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Xiao Li In the rule ResolveReferences, two private functions `resolve` and `resolveExpression` should be combined. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25429. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 2.5.0 > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Assignee: Yuming Wang >Priority: Major > Fix For: 2.5.0 > > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE
[ https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25458. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.5.0 > Support FOR ALL COLUMNS in ANALYZE TABLE > - > > Key: SPARK-25458 > URL: https://issues.apache.org/jira/browse/SPARK-25458 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.5.0 > > > Currently, to collect the statistics of all the columns, users need to > specify the names of all the columns when calling the command "ANALYZE TABLE > ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the > following SQL command to achieve it without specifying the column names. > {code:java} >ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25505: --- Assignee: Maryann Xue > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25505: Fix Version/s: 2.4.0 > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order
[ https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25505. - Resolution: Fixed > The output order of grouping columns in Pivot is different from the input > order > --- > > Key: SPARK-25505 > URL: https://issues.apache.org/jira/browse/SPARK-25505 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > Fix For: 2.4.0 > > > For example, > {code} > SELECT * FROM ( > SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, > "x" as x, "d" as d, "w" as w > FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET', 'Java') > ) > {code} > The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, > b, c, d, w, x, y, z, ..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25454) Division between operands with negative scale can cause precision loss
[ https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25454. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.4.0 2.3.3 > Division between operands with negative scale can cause precision loss > -- > > Key: SPARK-25454 > URL: https://issues.apache.org/jira/browse/SPARK-25454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Marco Gaido >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > The issue was originally reported by [~bersprockets] here: > https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104. > The problem consist in a precision loss when the second operand of the > division is a decimal with a negative scale. It was present also before 2.3 > but it was harder to reproduce: you had to do something like > {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with > SQL constants. > The problem is that our logic is taken from Hive and SQLServer where decimals > with negative scales are not allowed. We might also consider enforcing this > too in 3.0 eventually. Meanwhile we can fix the logic for computing the > result type for a division. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23839) consider bucket join in cost-based JoinReorder rule
[ https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628080#comment-16628080 ] Xiao Li commented on SPARK-23839: - To implement CBO in planner, we need a major change in our planner. The stats-based JoinReorder rule is just the current workaround before we doing the actual cost-based optimizer. > consider bucket join in cost-based JoinReorder rule > --- > > Key: SPARK-23839 > URL: https://issues.apache.org/jira/browse/SPARK-23839 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiaoju Wu >Priority: Minor > > Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark > 2.3 released, it is improved with histogram. While it doesn't take the cost > of the different join implementations. For example: > TableA JOIN TableB JOIN TableC > TableA will output 10,000 rows after filter and projection. > TableB will output 10,000 rows after filter and projection. > TableC will output 8,000 rows after filter and projection. > The current JoinReorder rule will possibly optimize the plan to join TableC > with TableA firstly and then TableB. But if the TableA and TableB are bucket > tables and can be applied with BucketJoin, it could be a different story. > > Also, to support bucket join of more than 2 tables when table bucket number > is multiple of another (SPARK-17570), whether bucket join can take effect > depends on the result of JoinReorder. For example of "A join B join C" which > has bucket number like 8, 4, 12, JoinReorder rule should optimize the order > to "A join B join C“ to make the bucket join take effect instead of "C join A > join B". > > Based on current CBO JoinReorder, there are possibly 2 part to be changed: > # CostBasedJoinReorder rule is applied in optimizer phase while we do Join > selection in planner phase and bucket join optimization in EnsureRequirements > which is in preparation phase. Both are after optimizer. > # Current statistics and join cost formula are based data selectivity and > cardinality, we need to add statistics for present the join method cost like > shuffle, sort, hash and etc. Also we need to add the statistics into the > formula to estimate the join cost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25465) Refactor Parquet test suites in project Hive
[ https://issues.apache.org/jira/browse/SPARK-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25465. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.5.0 > Refactor Parquet test suites in project Hive > > > Key: SPARK-25465 > URL: https://issues.apache.org/jira/browse/SPARK-25465 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.5.0 > > > Current the file > parquetSuites.scala(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala) > is not recognizable. > When I tried to find test suites for built-in Parquet conversions for Hive > serde, I can only find > HiveParquetSuite(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala) > in the first few minutes. > The file name and test suite naming can be revised. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623799#comment-16623799 ] Xiao Li commented on SPARK-24499: - ping [~XuanYuan] Any update? > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25179: Issue Type: Sub-task (was: Documentation) Parent: SPARK-25507 > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > binary type support requires pyarrow 0.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25507) Update documents for the new features in 2.4 release
Xiao Li created SPARK-25507: --- Summary: Update documents for the new features in 2.4 release Key: SPARK-25507 URL: https://issues.apache.org/jira/browse/SPARK-25507 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.4.0 Reporter: Xiao Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25494) Upgrade Spark's use of Janino to 3.0.10
[ https://issues.apache.org/jira/browse/SPARK-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25494. - Resolution: Fixed Assignee: Kris Mok Fix Version/s: 2.5.0 > Upgrade Spark's use of Janino to 3.0.10 > --- > > Key: SPARK-25494 > URL: https://issues.apache.org/jira/browse/SPARK-25494 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 2.5.0 > > > This ticket proposes to upgrade Spark's use of Janino from 3.0.9 to 3.0.10. > Note that 3.0.10 is a out-of-band release specifically for fixing an integer > overflow issue in Janino's {{ClassFile}} reader. It is otherwise exactly the > same as 3.0.9, so it's a low risk and compatible upgrade. > The integer overflow issue affects Spark SQL's codegen stats collection: when > a generated Class file is huge, especially when the constant pool size is > above {{Short.MAX_VALUE}}, Janino's {{ClassFile}} reader will throw an > exception when Spark wants to parse the generated Class file to collect > stats. So we'll miss the stats of some huge Class files. > The Janino fix is tracked by this issue: > https://github.com/janino-compiler/janino/issues/58 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24777) Add write benchmark for AVRO
[ https://issues.apache.org/jira/browse/SPARK-24777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24777. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.4.0 > Add write benchmark for AVRO > > > Key: SPARK-24777 > URL: https://issues.apache.org/jira/browse/SPARK-24777 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation
[ https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25450. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 2.4.0 2.3.3 > PushProjectThroughUnion rule uses the same exprId for project expressions in > each Union child, causing mistakes in constant propagation > --- > > Key: SPARK-25450 > URL: https://issues.apache.org/jira/browse/SPARK-25450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > Fix For: 2.3.3, 2.4.0 > > > The problem was cause by the PushProjectThroughUnion rule, which, when > creating new Project for each child of Union, uses the same exprId for > expressions of the same position. This is wrong because, for each child of > Union, the expressions are all independent, and it can lead to a wrong result > if other rules like FoldablePropagation kicks in, taking two different > expressions as the same. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25419) Parquet predicate pushdown improvement
[ https://issues.apache.org/jira/browse/SPARK-25419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25419: --- Assignee: Yuming Wang > Parquet predicate pushdown improvement > -- > > Key: SPARK-25419 > URL: https://issues.apache.org/jira/browse/SPARK-25419 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Parquet predicate pushdown support: ByteType, ShortType, DecimalType, > DateType, TimestampType. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE
Xiao Li created SPARK-25458: --- Summary: Support FOR ALL COLUMNS in ANALYZE TABLE Key: SPARK-25458 URL: https://issues.apache.org/jira/browse/SPARK-25458 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: Xiao Li Currently, to collect the statistics of all the columns, users need to specify the names of all the columns when calling the command "ANALYZE TABLE ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the following SQL command to achieve it without specifying the column names. {code:java} ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24151) CURRENT_DATE, CURRENT_TIMESTAMP incorrectly resolved as column names when caseSensitive is enabled
[ https://issues.apache.org/jira/browse/SPARK-24151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24151. - Resolution: Fixed Assignee: James Thompson Fix Version/s: 2.4.0 > CURRENT_DATE, CURRENT_TIMESTAMP incorrectly resolved as column names when > caseSensitive is enabled > -- > > Key: SPARK-24151 > URL: https://issues.apache.org/jira/browse/SPARK-24151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: James Thompson >Assignee: James Thompson >Priority: Major > Fix For: 2.4.0 > > > After this change: https://issues.apache.org/jira/browse/SPARK-22333 > Running SQL such as "CURRENT_TIMESTAMP" can fail spark.sql.caseSensitive has > been enabled: > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve '`CURRENT_TIMESTAMP`' > given input columns: [col1]{code} > This is due to the fact that the analyzer incorrectly uses a case sensitive > resolver to resolve the function. I will submit a PR with a fix + test for > this. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org