[jira] [Resolved] (SPARK-25635) Support selective direct encoding in native ORC write

2018-10-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25635.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Support selective direct encoding in native ORC write
> -
>
> Key: SPARK-25635
> URL: https://issues.apache.org/jira/browse/SPARK-25635
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
> `hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. 
> This is a big huddle to enable dictionary encoding.
> From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
> encoding selectively in a column-wise manner. This issue aims to add that 
> feature by upgrading ORC from 1.5.2 to 1.5.3.
> The followings are the patches in ORC 1.5.3 and this feature is the only one 
> related to Spark directly.
> {code}
> ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
> multi-byte data (gopalv)
> ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
> ORC-405. Remove calcite as a dependency from the benchmarks.
> ORC-375: Fix libhdfs on gcc7 by adding #include  two places.
> ORC-383: Parallel builds fails with ConcurrentModificationException
> ORC-382: Apache rat exclusions + add rat check to travis
> ORC-401: Fix incorrect quoting in specification.
> ORC-385. Change RecordReader to extend Closeable.
> ORC-384: [C++] fix memory leak when loading non-ORC files
> ORC-391: [c++] parseType does not accept underscore in the field name
> ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
> by Mithun Radhakrishnan.
> ORC-389: Add ability to not decode Acid metadata columns
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25626) HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec

2018-10-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25626.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 3.0.0

> HiveClientSuites: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false 46 sec
> --
>
> Key: SPARK-25626
> URL: https://issues.apache.org/jira/browse/SPARK-25626
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> HiveClientSuite.2.3: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false46 sec  Passed
> HiveClientSuite.2.2: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false45 sec  Passed
> HiveClientSuite.2.1: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false42 sec  Passed
> HiveClientSuite.2.0: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false39 sec  Passed
> HiveClientSuite.1.2: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false37 sec  Passed
> HiveClientSuite.1.1: getPartitionsByFilter returns all partitions when 
> hive.metastore.try.direct.sql=false36 sec  Passed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20845) Support specification of column names in INSERT INTO

2018-10-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20845:

Target Version/s: 3.0.0

> Support specification of column names in INSERT INTO
> 
>
> Key: SPARK-20845
> URL: https://issues.apache.org/jira/browse/SPARK-20845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> Some databases allow you to specify column names when specifying the target 
> of an INSERT INTO. For example, in SQLite:
> {code}
> sqlite> CREATE TABLE twocolumn (x INT, y INT); INSERT INTO twocolumn(x, y) 
> VALUES (44,51), (NULL,52), (42,53), (45,45)
>...> ;
> sqlite> select * from twocolumn;
> 44|51
> |52
> 42|53
> 45|45
> {code}
> I have a corpus of existing queries of this form which I would like to run on 
> Spark SQL, so I think we should extend our dialect to support this syntax.
> When implementing this, we should make sure to test the following behaviors 
> and corner-cases:
> - Number of columns specified is greater than or less than the number of 
> columns in the table.
> - Specification of repeated columns.
> - Specification of columns which do not exist in the target table.
> - Permute column order instead of using the default order in the table.
> For each of these, we should check how SQLite behaves and should also compare 
> against another database. It looks like T-SQL supports this; see 
> https://technet.microsoft.com/en-us/library/dd776381(v=sql.105).aspx under 
> the "Inserting data that is not in the same order as the table columns" 
> header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25655) Add Pspark-ganglia-lgpl to the scala style check

2018-10-05 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25655:
---

 Summary: Add Pspark-ganglia-lgpl to the scala style check
 Key: SPARK-25655
 URL: https://issues.apache.org/jira/browse/SPARK-25655
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


Our lint failed due to the following errors:
{code}
[INFO] --- scalastyle-maven-plugin:1.0.0:check (default) @ 
spark-ganglia-lgpl_2.11 ---
error 
file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
 message=
  Are you sure that you want to use toUpperCase or toLowerCase without the 
root locale? In most cases, you
  should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead.
  If you must use toUpperCase or toLowerCase without the root locale, wrap 
the code block with
  // scalastyle:off caselocale
  .toUpperCase
  .toLowerCase
  // scalastyle:on caselocale
 line=67 column=49
error 
file=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala
 message=
  Are you sure that you want to use toUpperCase or toLowerCase without the 
root locale? In most cases, you
  should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead.
  If you must use toUpperCase or toLowerCase without the root locale, wrap 
the code block with
  // scalastyle:off caselocale
  .toUpperCase
  .toLowerCase
  // scalastyle:on caselocale
 line=71 column=32
Saving to 
outputFile=/home/jenkins/workspace/spark-master-maven-snapshots/spark/external/spark-ganglia-lgpl/target/scalastyle-output.xml
{code}

See 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/8890/




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25606) DateExpressionsSuite: Hour 1 min

2018-10-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25606.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> DateExpressionsSuite: Hour 1 min
> 
>
> Key: SPARK-25606
> URL: https://issues.apache.org/jira/browse/SPARK-25606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25609) DataFrameSuite: SPARK-22226: splitExpressions should not generate codes beyond 64KB 49 seconds

2018-10-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25609.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 3.0.0

> DataFrameSuite: SPARK-6: splitExpressions should not generate codes 
> beyond 64KB 49 seconds
> --
>
> Key: SPARK-25609
> URL: https://issues.apache.org/jira/browse/SPARK-25609
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.DataFrameSuite.SPARK-6: splitExpressions should not 
> generate codes beyond 64KB 49 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25605) CastSuite: cast string to timestamp 2 mins 31 sec

2018-10-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25605.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 3.0.0

> CastSuite: cast string to timestamp 2 mins 31 sec
> -
>
> Key: SPARK-25605
> URL: https://issues.apache.org/jira/browse/SPARK-25605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.sql.catalyst.expressions.CastSuite.cast string to timestamp 
> took 2 min 31 secs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25606) DateExpressionsSuite: Hour 1 min

2018-10-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25606:
---

Assignee: Yuming Wang

> DateExpressionsSuite: Hour 1 min
> 
>
> Key: SPARK-25606
> URL: https://issues.apache.org/jira/browse/SPARK-25606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25632:
---

 Summary: KafkaRDDSuite: compacted topic 2 min 5 sec.
 Key: SPARK-25632
 URL: https://issues.apache.org/jira/browse/SPARK-25632
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic

Took 2 min 5 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25631:
---

 Summary: KafkaRDDSuite: basic usage2 min 4 sec
 Key: SPARK-25631
 URL: https://issues.apache.org/jira/browse/SPARK-25631
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li



org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage

Took 2 min 4 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25630) HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name collision while writing files 21 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25630:
---

 Summary: HiveOrcHadoopFsRelationSuite: SPARK-8406: Avoids name 
collision while writing files 21 sec
 Key: SPARK-25630
 URL: https://issues.apache.org/jira/browse/SPARK-25630
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.SPARK-8406: Avoids 
name collision while writing files

Took 21 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25629) ParquetFilterSuite: filter pushdown - decimal 16 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25629:
---

 Summary: ParquetFilterSuite: filter pushdown - decimal 16 sec
 Key: SPARK-25629
 URL: https://issues.apache.org/jira/browse/SPARK-25629
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite.filter 
pushdown - decimal

Took 16 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25628) DistributedSuite: recover from repeated node failures during shuffle-reduce 40 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25628:
---

 Summary: DistributedSuite: recover from repeated node failures 
during shuffle-reduce 40 seconds
 Key: SPARK-25628
 URL: https://issues.apache.org/jira/browse/SPARK-25628
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.DistributedSuite.recover from repeated node failures during 
shuffle-reduce 40 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25627) ContinuousStressSuite - 8 mins 13 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25627:
---

 Summary: ContinuousStressSuite - 8 mins 13 sec
 Key: SPARK-25627
 URL: https://issues.apache.org/jira/browse/SPARK-25627
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


ContinuousStressSuite - 8 mins 13 sec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25626) HiveClientSuites: getPartitionsByFilter returns all partitions when hive.metastore.try.direct.sql=false 46 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25626:
---

 Summary: HiveClientSuites: getPartitionsByFilter returns all 
partitions when hive.metastore.try.direct.sql=false 46 sec
 Key: SPARK-25626
 URL: https://issues.apache.org/jira/browse/SPARK-25626
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


HiveClientSuite.2.3: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  46 sec  Passed
HiveClientSuite.2.2: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  45 sec  Passed
HiveClientSuite.2.1: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  42 sec  Passed
HiveClientSuite.2.0: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  39 sec  Passed
HiveClientSuite.1.2: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  37 sec  Passed
HiveClientSuite.1.1: getPartitionsByFilter returns all partitions when 
hive.metastore.try.direct.sql=false  36 sec  Passed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25625) LogisticRegressionSuite.binary logistic regression with intercept with ElasticNet regularization - 33 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25625:
---

 Summary: LogisticRegressionSuite.binary logistic regression with 
intercept with ElasticNet regularization - 33 sec
 Key: SPARK-25625
 URL: https://issues.apache.org/jira/browse/SPARK-25625
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


LogisticRegressionSuite.binary logistic regression with intercept with 
ElasticNet regularization

Took 33 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25624) LogisticRegressionSuite.multinomial logistic regression with intercept with elasticnet regularization 56 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25624:
---

 Summary: LogisticRegressionSuite.multinomial logistic regression 
with intercept with elasticnet regularization 56 seconds
 Key: SPARK-25624
 URL: https://issues.apache.org/jira/browse/SPARK-25624
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic 
regression with intercept with elasticnet regularization

Took 56 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25623) LogisticRegressionSuite: multinomial logistic regression with intercept with L1 regularization 1 min 10 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25623:
---

 Summary: LogisticRegressionSuite: multinomial logistic regression 
with intercept with L1 regularization 1 min 10 sec
 Key: SPARK-25623
 URL: https://issues.apache.org/jira/browse/SPARK-25623
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.ml.classification.LogisticRegressionSuite.multinomial logistic 
regression with intercept with L1 regularization

Took 1 min 10 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25622) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables with bucket pruning filters - 42 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25622:
---

 Summary: BucketedReadWithHiveSupportSuite: read partitioning 
bucketed tables with bucket pruning filters - 42 seconds
 Key: SPARK-25622
 URL: https://issues.apache.org/jira/browse/SPARK-25622
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning 
bucketed tables with bucket pruning filters

Took 42 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25621) BucketedReadWithHiveSupportSuite: read partitioning bucketed tables having composite filters 45 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25621:
---

 Summary: BucketedReadWithHiveSupportSuite: read partitioning 
bucketed tables having composite filters   45 sec
 Key: SPARK-25621
 URL: https://issues.apache.org/jira/browse/SPARK-25621
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.sources.BucketedReadWithHiveSupportSuite.read partitioning 
bucketed tables having composite filters

Took 45 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds

2018-10-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25620:

Description: 
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 36 sec.

org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 24 sec.

  was:
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 36 sec.


> WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds
> 
>
> Key: SPARK-25620
> URL: https://issues.apache.org/jira/browse/SPARK-25620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
> recovery
> Took 1 min 36 sec.
> org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.failure
>  recovery
> Took 1 min 24 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec

2018-10-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25619:

Description: 
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and 
merge shards in a stream 2 min 15 sec

org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split 
and merge shards in a stream 1 min 52 sec.


  was:
org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and 
merge shards in a stream 2 min 15 sec



> WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 
> 15 sec
> --
>
> Key: SPARK-25619
> URL: https://issues.apache.org/jira/browse/SPARK-25619
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split 
> and merge shards in a stream 2 min 15 sec
> org.apache.spark.streaming.kinesis.WithoutAggregationKinesisStreamSuite.split 
> and merge shards in a stream 1 min 52 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25620) WithAggregationKinesisStreamSuite: failure recovery 1 min 36 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25620:
---

 Summary: WithAggregationKinesisStreamSuite: failure recovery 1 min 
36 seconds
 Key: SPARK-25620
 URL: https://issues.apache.org/jira/browse/SPARK-25620
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.failure 
recovery

Took 1 min 36 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25619) WithAggregationKinesisStreamSuite: split and merge shards in a stream 2 min 15 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25619:
---

 Summary: WithAggregationKinesisStreamSuite: split and merge shards 
in a stream 2 min 15 sec
 Key: SPARK-25619
 URL: https://issues.apache.org/jira/browse/SPARK-25619
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.streaming.kinesis.WithAggregationKinesisStreamSuite.split and 
merge shards in a stream 2 min 15 sec




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25618) KafkaContinuousSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false 1 min 1 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25618:
---

 Summary: KafkaContinuousSourceStressForDontFailOnDataLossSuite: 
stress test for failOnDataLoss=false 1 min 1 sec
 Key: SPARK-25618
 URL: https://issues.apache.org/jira/browse/SPARK-25618
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaContinuousSourceStressForDontFailOnDataLossSuite.stress
 test for failOnDataLoss=false 1 min 1 sec




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25617) KafkaContinuousSinkSuite: generic - write big data with small producer buffer 56 secs

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25617:
---

 Summary: KafkaContinuousSinkSuite: generic - write big data with 
small producer buffer 56 secs
 Key: SPARK-25617
 URL: https://issues.apache.org/jira/browse/SPARK-25617
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaContinuousSinkSuite.generic - write big data 
with small producer buffer 56 seconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25616) KafkaSinkSuite: generic - write big data with small producer buffer 57 secs

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25616:
---

 Summary: KafkaSinkSuite: generic - write big data with small 
producer buffer 57 secs
 Key: SPARK-25616
 URL: https://issues.apache.org/jira/browse/SPARK-25616
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaSinkSuite.generic - write big data with 
small producer buffer 57 secs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25615) KafkaSinkSuite: streaming - write to non-existing topic 1 min

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25615:
---

 Summary: KafkaSinkSuite: streaming - write to non-existing topic 1 
min
 Key: SPARK-25615
 URL: https://issues.apache.org/jira/browse/SPARK-25615
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.kafka010.KafkaSinkSuite.streaming - write to non-existing 
topic 1 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25614) HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not fail with format class not found 38 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25614:
---

 Summary: HiveSparkSubmitSuite: SPARK-18989: DESC TABLE should not 
fail with format class not found 38 seconds
 Key: SPARK-25614
 URL: https://issues.apache.org/jira/browse/SPARK-25614
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-18989: DESC TABLE should 
not fail with format class not found 38 seconds




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25613) HiveSparkSubmitSuite: dir 1 min 3 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25613:
---

 Summary: HiveSparkSubmitSuite: dir 1 min 3 seconds
 Key: SPARK-25613
 URL: https://issues.apache.org/jira/browse/SPARK-25613
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.HiveSparkSubmitSuite.dir 1 mins 3 sec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25612) CompressionCodecSuite: table-level compression is not set but session-level compressions 47 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25612:
---

 Summary: CompressionCodecSuite: table-level compression is not set 
but session-level compressions 47 seconds
 Key: SPARK-25612
 URL: https://issues.apache.org/jira/browse/SPARK-25612
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.CompressionCodecSuite.table-level compression is not 
set but session-level compressions is set 47 seconds




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25611) CompressionCodecSuite: both table-level and session-level compression are set 2 min 20 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25611:
---

 Summary: CompressionCodecSuite: both table-level and session-level 
compression are set 2 min 20 sec
 Key: SPARK-25611
 URL: https://issues.apache.org/jira/browse/SPARK-25611
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.CompressionCodecSuite.both table-level and 
session-level compression are set: 2 min 20 sec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25610) DatasetCacheSuite: cache UDF result correctly 25 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25610:
---

 Summary: DatasetCacheSuite: cache UDF result correctly 25 seconds
 Key: SPARK-25610
 URL: https://issues.apache.org/jira/browse/SPARK-25610
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.DatasetCacheSuite.cache UDF result correctly 25 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25609) DataFrameSuite: SPARK-22226: splitExpressions should not generate codes beyond 64KB 49 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25609:
---

 Summary: DataFrameSuite: SPARK-6: splitExpressions should not 
generate codes beyond 64KB 49 seconds
 Key: SPARK-25609
 URL: https://issues.apache.org/jira/browse/SPARK-25609
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.DataFrameSuite.SPARK-6: splitExpressions should not 
generate codes beyond 64KB 49 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25608) HashAggregationQueryWithControlledFallbackSuite: multiple distinct multiple columns sets 38 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25608:
---

 Summary: HashAggregationQueryWithControlledFallbackSuite: multiple 
distinct multiple columns sets 38 seconds
 Key: SPARK-25608
 URL: https://issues.apache.org/jira/browse/SPARK-25608
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.multiple
 distinct multiple columns sets 38 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25607) HashAggregationQueryWithControlledFallbackSuite: single distinct column set 42 seconds

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25607:
---

 Summary: HashAggregationQueryWithControlledFallbackSuite: single 
distinct column set 42 seconds
 Key: SPARK-25607
 URL: https://issues.apache.org/jira/browse/SPARK-25607
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite.single
 distinct column set 42 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25606) DateExpressionsSuite: Hour 1 min

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25606:
---

 Summary: DateExpressionsSuite: Hour 1 min
 Key: SPARK-25606
 URL: https://issues.apache.org/jira/browse/SPARK-25606
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.catalyst.expressions.DateExpressionsSuite.Hour 1 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25605) CastSuite: cast string to timestamp 2 mins 31 sec

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25605:
---

 Summary: CastSuite: cast string to timestamp 2 mins 31 sec
 Key: SPARK-25605
 URL: https://issues.apache.org/jira/browse/SPARK-25605
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


org.apache.spark.sql.catalyst.expressions.CastSuite.cast string to timestamp 
took 2 min 31 secs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25604) Reduce the overall time costs in Jenkins tests

2018-10-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25604:
---

 Summary: Reduce the overall time costs in Jenkins tests 
 Key: SPARK-25604
 URL: https://issues.apache.org/jira/browse/SPARK-25604
 Project: Spark
  Issue Type: Umbrella
  Components: Tests
Affects Versions: 3.0.0
Reporter: Xiao Li


Currently, our Jenkins tests took almost 5 hours. To reduce the test time, 
below is my suggestion:
* split the tests to multiple individual Jenkins jobs
* tune the confs in the test framework;
* for the slow test cases, we can rewrite the test cases or even optimize the 
source code to speed up them;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24499:

Target Version/s: 3.0.0

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-10-02 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635836#comment-16635836
 ] 

Xiao Li commented on SPARK-24499:
-

[~XuanYuan] Yeah. Let us do the split first, and then discuss how to enrich our 
doc. We need to add a lot of stuffs, if we compare our doc with the other 
popular OSS DBMS project, e.g., Postgres, MySQL and so on. 

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25414) make it clear that the numRows metrics should be counted for each scan of the source

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25414:

Fix Version/s: (was: 2.5.0)

> make it clear that the numRows metrics should be counted for each scan of the 
> source
> 
>
> Key: SPARK-25414
> URL: https://issues.apache.org/jira/browse/SPARK-25414
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25426) Remove the duplicate fallback logic in UnsafeProjection

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25426:

Fix Version/s: (was: 2.5.0)

> Remove the duplicate fallback logic in UnsafeProjection
> ---
>
> Key: SPARK-25426
> URL: https://issues.apache.org/jira/browse/SPARK-25426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25381) Stratified sampling by Column argument

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25381:
---

Assignee: Maxim Gekk

> Stratified sampling by Column argument
> --
>
> Key: SPARK-25381
> URL: https://issues.apache.org/jira/browse/SPARK-25381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently the sampleBy method accepts the first argument of string type only. 
> Need to provide overloaded method which accepts Column type too. So, it will 
> allow sampling by multiple columns , for example:
> {code:scala}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.functions.struct
> val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), 
> ("Bob", 17),
>   ("Alice", 10))).toDF("name", "age")
> val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
> df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
>+-+---+
>| name|age|
>+-+---+
>| Nico|  8|
>|Alice| 10|
>+-+---+
> {code} 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25415:

Fix Version/s: (was: 2.5.0)

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25423:

Fix Version/s: (was: 2.5.0)

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Assignee: Yuming Wang
>Priority: Trivial
>  Labels: starter
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25449) Don't send zero accumulators in heartbeats

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25449:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 3.0.0
>
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25472) Structured Streaming query.stop() doesn't always stop gracefully

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25472:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Structured Streaming query.stop() doesn't always stop gracefully
> 
>
> Key: SPARK-25472
> URL: https://issues.apache.org/jira/browse/SPARK-25472
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> We can have race conditions where the cancelling of Spark jobs will throw a 
> SparkException when stopping a streaming query. This SparkException specifies 
> that the job was cancelled. We can use this error message to swallow the 
> error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25458:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Support FOR ALL COLUMNS in ANALYZE TABLE 
> -
>
> Key: SPARK-25458
> URL: https://issues.apache.org/jira/browse/SPARK-25458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, to collect the statistics of all the columns, users need to 
> specify the names of all the columns when calling the command "ANALYZE TABLE 
> ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the 
> following SQL command to achieve it without specifying the column names.
> {code:java}
>ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25457) IntegralDivide (div) should not always return long

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25457:

Fix Version/s: (was: 2.5.0)
   3.0.0

> IntegralDivide (div) should not always return long
> --
>
> Key: SPARK-25457
> URL: https://issues.apache.org/jira/browse/SPARK-25457
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> The operation {{div}} returns always long. This came from Hive's behavior, 
> which is different to the  one of most of other DBMS (eg. MySQL, Postgres) 
> which return as a datatype the same of the operands.
> This JIRA tracks changing our return type and allowing the users to re-enable 
> the old behavior using {{spark.sql.legacy.integralDivide.returnBigint}}.
> I'll submit a PR for this soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25429:

Fix Version/s: (was: 2.5.0)

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25444:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor GenArrayData.genCodeToCreateArrayData() method
> ---
>
> Key: SPARK-25444
> URL: https://issues.apache.org/jira/browse/SPARK-25444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 3.0.0
>
>
> {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a 
> temporary Java array to create  {{ArrayData}}. It can be eliminated by using 
> {{ArrayData.createArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25447) Support JSON options by schema_of_json

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25447:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Support JSON options by schema_of_json
> --
>
> Key: SPARK-25447
> URL: https://issues.apache.org/jira/browse/SPARK-25447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The function schema_of_json doesn't accept any options currently but the 
> options can impact on schema inferring. Need to support the same options that 
> from_json() can use on schema inferring. Here is examples of options that 
> could impact on schema inferring:
> * primitivesAsString
> * prefersDecimal
> * allowComments
> * allowUnquotedFieldNames
> * allowSingleQuotes
> * allowNumericLeadingZeros
> * allowNonNumericNumbers
> * allowBackslashEscapingAnyCharacter
> * allowUnquotedControlChars
> Below is possible signature:
> {code:scala}
> def schema_of_json(e: Column, options: java.util.Map[String, String]): Column
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25465) Refactor Parquet test suites in project Hive

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25465:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor Parquet test suites in project Hive
> 
>
> Key: SPARK-25465
> URL: https://issues.apache.org/jira/browse/SPARK-25465
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Current the file 
> parquetSuites.scala(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala)
>  is not recognizable. 
> When I tried to find test suites for built-in Parquet conversions for Hive 
> serde, I can only find 
> HiveParquetSuite(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala)
>  in the first few minutes.
> The file name and test suite naming can be revised.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25476) Refactor AggregateBenchmark to use main method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25476:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor AggregateBenchmark to use main method
> --
>
> Key: SPARK-25476
> URL: https://issues.apache.org/jira/browse/SPARK-25476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25473) PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25473:

Fix Version/s: (was: 2.5.0)
   3.0.0

> PySpark ForeachWriter test fails on Python 3.6 and macOS High Serria
> 
>
> Key: SPARK-25473
> URL: https://issues.apache.org/jira/browse/SPARK-25473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> PYSPARK_PYTHON=python3.6 SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests 
> SQLTests
> {code}
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> /usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766:
>  ResourceWarning: subprocess 27563 is still running
>   ResourceWarning, source=self)
> [Stage 0:>  (0 + 1) / 
> 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
> progress in another thread when fork() was called.
> objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in 
> progress in another thread when fork() was called. We cannot safely call it 
> or ignore it in the fork() child process. Crashing instead. Set a breakpoint 
> on objc_initializeAfterForkError to debug.
> ERROR
> ==
> ERROR: test_streaming_foreach_with_simple_function 
> (pyspark.sql.tests.SQLTests)
> --
> Traceback (most recent call last):
>   File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o54.processAllAvailable.
> : org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted.
> === Streaming Query ===
> Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 
> 08d1435b-5358-4fb6-b167-811584a3163e]
> Current Committed Offsets: {}
> Current Available Offsets: 
> {FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]:
>  {"logOffset":0}}
> Current State: ACTIVE
> Thread State: RUNNABLE
> Logical Plan:
> FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hrgp/T/tmpolebys1s]
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Writing job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
>   at org.apache.spark.sql.Dataset.collect(Dataset.scala:2783)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$

[jira] [Updated] (SPARK-25486) Refactor SortBenchmark to use main method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25486:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor SortBenchmark to use main method
> -
>
> Key: SPARK-25486
> URL: https://issues.apache.org/jira/browse/SPARK-25486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25499) Refactor BenchmarkBase and Benchmark

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25499:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor BenchmarkBase and Benchmark
> 
>
> Key: SPARK-25499
> URL: https://issues.apache.org/jira/browse/SPARK-25499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently there are two classes with the same naming BenchmarkBase:
> 1. org.apache.spark.util.BenchmarkBase
> 2. org.apache.spark.sql.execution.benchmark.BenchmarkBase
> Here I propose:
> 1. the package org.apache.spark.util.BenchmarkBase should be in test package, 
> move to org.apache.spark.sql.execution.benchmark .
> 2. Rename the org.apache.spark.sql.execution.benchmark.BenchmarkBase as 
> BenchmarkWithCodegen
> 3. Move  org.apache.spark.util.Benchmark to test package of 
> org.apache.spark.sql.execution.benchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25489) Refactor UDTSerializationBenchmark

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25489:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor UDTSerializationBenchmark
> --
>
> Key: SPARK-25489
> URL: https://issues.apache.org/jira/browse/SPARK-25489
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 3.0.0
>
>
> Refactor UDTSerializationBenchmark to use main method and print the output as 
> a separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25481) Refactor ColumnarBatchBenchmark to use main method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25481:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor ColumnarBatchBenchmark to use main method
> --
>
> Key: SPARK-25481
> URL: https://issues.apache.org/jira/browse/SPARK-25481
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25485) Refactor UnsafeProjectionBenchmark to use main method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25485:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor UnsafeProjectionBenchmark to use main method
> -
>
> Key: SPARK-25485
> URL: https://issues.apache.org/jira/browse/SPARK-25485
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25478) Refactor CompressionSchemeBenchmark to use main method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25478:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor CompressionSchemeBenchmark to use main method
> --
>
> Key: SPARK-25478
> URL: https://issues.apache.org/jira/browse/SPARK-25478
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25494) Upgrade Spark's use of Janino to 3.0.10

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25494:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Upgrade Spark's use of Janino to 3.0.10
> ---
>
> Key: SPARK-25494
> URL: https://issues.apache.org/jira/browse/SPARK-25494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.0.0
>
>
> This ticket proposes to upgrade Spark's use of Janino from 3.0.9 to 3.0.10.
> Note that 3.0.10 is a out-of-band release specifically for fixing an integer 
> overflow issue in Janino's {{ClassFile}} reader. It is otherwise exactly the 
> same as 3.0.9, so it's a low risk and compatible upgrade.
> The integer overflow issue affects Spark SQL's codegen stats collection: when 
> a generated Class file is huge, especially when the constant pool size is 
> above {{Short.MAX_VALUE}}, Janino's {{ClassFile}} reader will throw an 
> exception when Spark wants to parse the generated Class file to collect 
> stats. So we'll miss the stats of some huge Class files.
> The Janino fix is tracked by this issue: 
> https://github.com/janino-compiler/janino/issues/58



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25487) Refactor PrimitiveArrayBenchmark

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25487:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor PrimitiveArrayBenchmark
> 
>
> Key: SPARK-25487
> URL: https://issues.apache.org/jira/browse/SPARK-25487
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 3.0.0
>
>
> Refactor PrimitiveArrayBenchmark to use main method and print the output as a 
> separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25508) Refactor OrcReadBenchmark to use main method

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25508:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Refactor OrcReadBenchmark to use main method
> 
>
> Key: SPARK-25508
> URL: https://issues.apache.org/jira/browse/SPARK-25508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: yucai
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25510) Create a new trait SqlBasedBenchmark

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25510:

Fix Version/s: (was: 2.5.0)
   3.0.0

>  Create a new trait SqlBasedBenchmark
> -
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25534) Make `SQLHelper` trait

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25534:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Make `SQLHelper` trait
> --
>
> Key: SPARK-25534
> URL: https://issues.apache.org/jira/browse/SPARK-25534
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark has 7 `withTempPath` and 6 `withSQLConf` functions. This PR 
> aims to remove duplicated and inconsistent code and reduce them to the 
> following meaningful implementations.
> *withTempPath*
> - `SQLHelper.withTempPath`: The one which was used in `SQLTestUtils`.
> *withSQLConf*
> - `SQLHelper.withSQLConf`: The one which was used in `PlanTest`.
> - `ExecutorSideSQLConfSuite.withSQLConf`: The one which doesn't throw 
> `AnalysisException` on StaticConf changes.
> - `SQLTestUtils.withSQLConf`: The one which overrides intentionally to change 
> the active session.
> {code}
>   protected override def withSQLConf(pairs: (String, String)*)(f: => Unit): 
> Unit = {
> SparkSession.setActiveSession(spark)
> super.withSQLConf(pairs: _*)(f)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25541) CaseInsensitiveMap should be serializable after '-' operator

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25541:

Fix Version/s: (was: 2.5.0)
   3.0.0

> CaseInsensitiveMap should be serializable after '-' operator
> 
>
> Key: SPARK-25541
> URL: https://issues.apache.org/jira/browse/SPARK-25541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25540) Make HiveContext in PySpark behave as the same as Scala.

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25540:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Make HiveContext in PySpark behave as the same as Scala.
> 
>
> Key: SPARK-25540
> URL: https://issues.apache.org/jira/browse/SPARK-25540
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> In Scala, {{HiveContext}} sets a config {{spark.sql.catalogImplementation}} 
> of the given {{SparkContext}} and then passes to {{SparkSession.builder}}.
> The {{HiveContext}} in PySpark should behave as the same as it in Scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25525) Do not update conf for existing SparkContext in SparkSession.getOrCreate.

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25525:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Do not update conf for existing SparkContext in SparkSession.getOrCreate.
> -
>
> Key: SPARK-25525
> URL: https://issues.apache.org/jira/browse/SPARK-25525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> In SPARK-20946, we modified {{SparkSession.getOrCreate}} to not update conf 
> for existing {{SparkContext}} because {{SparkContext}} is shared by all 
> sessions.
> We should not update it in PySpark side as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25551) Remove unused InSubquery expression

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25551:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Remove unused InSubquery expression
> ---
>
> Key: SPARK-25551
> URL: https://issues.apache.org/jira/browse/SPARK-25551
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 3.0.0
>
>
> SPARK-16958 introduced a {{InSubquery}} expression. Its only usage was 
> removed in SPARK-18874. Hence now it is not used anymore and it can be 
> removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25514) Generating pretty JSON by to_json

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25514:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Generating pretty JSON by to_json
> -
>
> Key: SPARK-25514
> URL: https://issues.apache.org/jira/browse/SPARK-25514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> It would be nice to have an option, for example *"pretty"*, which enable 
> special output mode for the to_json function. In the mode, produced JSON 
> string will have easily readable representation. For example:
> {code:scala}
> val json = 
> """[{"book":{"publisher":[{"country":"NL","year":[1981,1986,1999]}]}}]"""
> to_json(from_json('col, ...), Map("pretty" -> "true")))
> [ {
>   "book" : {
> "publisher" : [ {
>   "country" : "NL",
>   "year" : [ 1981, 1986, 1999 ]
> } ]
>   }
> } ]
> {code}
> There are at least two use cases:
> # Exploring content of nested columns. For example, a result of your query is 
> a few rows, and some columns have deep nested structure. And you want to 
> analyze and find a value of one of nested fields.
> # You already have an JSON in one of columns, and want to explore the JSON 
> records. New option will allow to do that easily without copy-past JSON 
> content to an editor by combining from_json and to_json functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25559) Just remove the unsupported predicates in Parquet

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25559:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Just remove the unsupported predicates in Parquet
> -
>
> Key: SPARK-25559
> URL: https://issues.apache.org/jira/browse/SPARK-25559
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, in *ParquetFilters*, if one of the children predicates is not 
> supported by Parquet, the entire predicates will be thrown away. In fact, if 
> the unsupported predicate is in the top level *And* condition or in the child 
> before hitting *Not* or *Or* condition, it's safe to just remove the 
> unsupported one as unhandled filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25565) Add scala style checker to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25565:

Fix Version/s: (was: 2.5.0)
   3.0.0

> Add scala style checker to check add Locale.ROOT to .toLowerCase and 
> .toUpperCase for internal calls
> 
>
> Key: SPARK-25565
> URL: https://issues.apache.org/jira/browse/SPARK-25565
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25575) SQL tab in the spark UI doesn't have option of hiding tables, eventhough other UI tabs has.

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25575:

Fix Version/s: (was: 2.5.0)
   3.0.0

> SQL tab in the spark UI doesn't have option of  hiding tables, eventhough 
> other UI tabs has. 
> -
>
> Key: SPARK-25575
> URL: https://issues.apache.org/jira/browse/SPARK-25575
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: Screenshot from 2018-09-29 23-26-45.png, Screenshot from 
> 2018-09-29 23-26-57.png
>
>
> Test tests:
>  1) bin/spark-shell
> {code:java}
> sql("create table a (id int)")
> for(i <- 1 to 100) sql(s"insert into a values ($i)")
> {code}
> Open SQL tab in the web UI,
>  !Screenshot from 2018-09-29 23-26-45.png! 
> Open Jobs tab,
>  !Screenshot from 2018-09-29 23-26-57.png! 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT

2018-10-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25592.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Bump master branch version to 3.0.0-SNAPSHOT
> 
>
> Key: SPARK-25592
> URL: https://issues.apache.org/jira/browse/SPARK-25592
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> This patch bumps the master branch version to `3.0.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT

2018-10-01 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25592:
---

 Summary: Bump master branch version to 3.0.0-SNAPSHOT
 Key: SPARK-25592
 URL: https://issues.apache.org/jira/browse/SPARK-25592
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


This patch bumps the master branch version to `3.0.0-SNAPSHOT`.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23210) Introduce the concept of default value to schema

2018-09-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23210:

Target Version/s: 3.0.0

> Introduce the concept of default value to schema
> 
>
> Key: SPARK-23210
> URL: https://issues.apache.org/jira/browse/SPARK-23210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: LvDongrong
>Priority: Major
>
> There is no concept of DEFAULT VALUE for schema in spark now.
> Our team want to support insert into serial columns of table,like "insert 
> into (a, c) values ("value1", "value2") for our use case, but the default 
> vaule of column is not definited. In hive, the default vaule of column is 
> NULL if we don't specify.
> So I think maybe it is necessary to introduce the concept of default value to 
> schema in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25453.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25453:
---

Assignee: Chenxiao Mao

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Chenxiao Mao
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25576) Fix lint failure in 2.2

2018-09-29 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25576:
---

 Summary: Fix lint failure in 2.2
 Key: SPARK-25576
 URL: https://issues.apache.org/jira/browse/SPARK-25576
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.2
Reporter: Xiao Li


See the errors:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.2-lint/913/console



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25568) Continue to update the remaining accumulators when failing to update one accumulator

2018-09-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25568.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3
   2.2.3

> Continue to update the remaining accumulators when failing to update one 
> accumulator
> 
>
> Key: SPARK-25568
> URL: https://issues.apache.org/jira/browse/SPARK-25568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> Currently when failing to update an accumulator, 
> DAGScheduler.updateAccumulators will skip the remaining accumulators. We 
> should try to update the remaining accumulators if possible so that they can 
> still report correct values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25573) Combine resolveExpression and resolve in the rule ResolveReferences

2018-09-28 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25573:
---

 Summary: Combine resolveExpression and resolve in the rule 
ResolveReferences
 Key: SPARK-25573
 URL: https://issues.apache.org/jira/browse/SPARK-25573
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li


In the rule ResolveReferences, two private functions `resolve` and 
`resolveExpression` should be combined. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25429.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.5.0

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-09-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25458.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.5.0

> Support FOR ALL COLUMNS in ANALYZE TABLE 
> -
>
> Key: SPARK-25458
> URL: https://issues.apache.org/jira/browse/SPARK-25458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.5.0
>
>
> Currently, to collect the statistics of all the columns, users need to 
> specify the names of all the columns when calling the command "ANALYZE TABLE 
> ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the 
> following SQL command to achieve it without specifying the column names.
> {code:java}
>ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25505:
---

Assignee: Maryann Xue

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25505:

Fix Version/s: 2.4.0

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25505) The output order of grouping columns in Pivot is different from the input order

2018-09-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25505.
-
Resolution: Fixed

> The output order of grouping columns in Pivot is different from the input 
> order
> ---
>
> Key: SPARK-25505
> URL: https://issues.apache.org/jira/browse/SPARK-25505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
> Fix For: 2.4.0
>
>
> For example,
> {code}
> SELECT * FROM (
>   SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, 
> "x" as x, "d" as d, "w" as w
>   FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET', 'Java')
> )
> {code}
> The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, 
> b, c, d, w, x, y, z, ..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25454) Division between operands with negative scale can cause precision loss

2018-09-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25454.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.4.0
   2.3.3

> Division between operands with negative scale can cause precision loss
> --
>
> Key: SPARK-25454
> URL: https://issues.apache.org/jira/browse/SPARK-25454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Marco Gaido
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> The issue was originally reported by [~bersprockets] here: 
> https://issues.apache.org/jira/browse/SPARK-22036?focusedCommentId=16618104&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16618104.
> The problem consist in a precision loss when the second operand of the 
> division is a decimal with a negative scale. It was present also before 2.3 
> but it was harder to reproduce: you had to do something like 
> {{lit(BigDecimal(100e6))}}, while now this can happen more frequently with 
> SQL constants.
> The problem is that our logic is taken from Hive and SQLServer where decimals 
> with negative scales are not allowed. We might also consider enforcing this 
> too in 3.0 eventually. Meanwhile we can fix the logic for computing the 
> result type for a division.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23839) consider bucket join in cost-based JoinReorder rule

2018-09-25 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628080#comment-16628080
 ] 

Xiao Li commented on SPARK-23839:
-

To implement CBO in planner, we need a major change in our planner. The 
stats-based JoinReorder rule is just the current workaround before we doing the 
actual cost-based optimizer. 

> consider bucket join in cost-based JoinReorder rule
> ---
>
> Key: SPARK-23839
> URL: https://issues.apache.org/jira/browse/SPARK-23839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiaoju Wu
>Priority: Minor
>
> Since spark 2.2, the cost-based JoinReorder rule is implemented and in Spark 
> 2.3 released, it is improved with histogram. While it doesn't take the cost 
> of the different join implementations. For example:
> TableA JOIN TableB JOIN TableC
> TableA  will output 10,000 rows after filter and projection. 
> TableB  will output 10,000 rows after filter and projection. 
> TableC  will output 8,000 rows after filter and projection. 
> The current JoinReorder rule will possibly optimize the plan to join TableC 
> with TableA firstly and then TableB. But if the TableA and TableB are bucket 
> tables and can be applied with BucketJoin, it could be a different story. 
>  
> Also, to support bucket join of more than 2 tables when table bucket number 
> is multiple of another (SPARK-17570), whether bucket join can take effect 
> depends on the result of JoinReorder. For example of "A join B join C" which 
> has bucket number like 8, 4, 12, JoinReorder rule should optimize the order 
> to "A join B join C“ to make the bucket join take effect instead of "C join A 
> join B". 
>  
> Based on current CBO JoinReorder, there are possibly 2 part to be changed:
>  # CostBasedJoinReorder rule is applied in optimizer phase while we do Join 
> selection in planner phase and bucket join optimization in EnsureRequirements 
> which is in preparation phase. Both are after optimizer. 
>  # Current statistics and join cost formula are based data selectivity and 
> cardinality, we need to add statistics for present the join method cost like 
> shuffle, sort, hash and etc. Also we need to add the statistics into the 
> formula to estimate the join cost. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25465) Refactor Parquet test suites in project Hive

2018-09-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25465.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.5.0

> Refactor Parquet test suites in project Hive
> 
>
> Key: SPARK-25465
> URL: https://issues.apache.org/jira/browse/SPARK-25465
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.5.0
>
>
> Current the file 
> parquetSuites.scala(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala)
>  is not recognizable. 
> When I tried to find test suites for built-in Parquet conversions for Hive 
> serde, I can only find 
> HiveParquetSuite(https://github.com/apache/spark/blob/f29c2b5287563c0d6f55f936bd5a75707d7b2b1f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala)
>  in the first few minutes.
> The file name and test suite naming can be revised.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-09-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623799#comment-16623799
 ] 

Xiao Li commented on SPARK-24499:
-

ping [~XuanYuan] Any update?

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10

2018-09-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25179:

Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-25507

> Document the features that require Pyarrow 0.10
> ---
>
> Key: SPARK-25179
> URL: https://issues.apache.org/jira/browse/SPARK-25179
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Document the features that require Pyarrow 0.10 . For 
> example, https://github.com/apache/spark/pull/20725
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> binary type support requires pyarrow 0.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25507) Update documents for the new features in 2.4 release

2018-09-21 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25507:
---

 Summary: Update documents for the new features in 2.4 release
 Key: SPARK-25507
 URL: https://issues.apache.org/jira/browse/SPARK-25507
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiao Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25494) Upgrade Spark's use of Janino to 3.0.10

2018-09-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25494.
-
   Resolution: Fixed
 Assignee: Kris Mok
Fix Version/s: 2.5.0

> Upgrade Spark's use of Janino to 3.0.10
> ---
>
> Key: SPARK-25494
> URL: https://issues.apache.org/jira/browse/SPARK-25494
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.5.0
>
>
> This ticket proposes to upgrade Spark's use of Janino from 3.0.9 to 3.0.10.
> Note that 3.0.10 is a out-of-band release specifically for fixing an integer 
> overflow issue in Janino's {{ClassFile}} reader. It is otherwise exactly the 
> same as 3.0.9, so it's a low risk and compatible upgrade.
> The integer overflow issue affects Spark SQL's codegen stats collection: when 
> a generated Class file is huge, especially when the constant pool size is 
> above {{Short.MAX_VALUE}}, Janino's {{ClassFile}} reader will throw an 
> exception when Spark wants to parse the generated Class file to collect 
> stats. So we'll miss the stats of some huge Class files.
> The Janino fix is tracked by this issue: 
> https://github.com/janino-compiler/janino/issues/58



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24777) Add write benchmark for AVRO

2018-09-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24777.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Add write benchmark for AVRO
> 
>
> Key: SPARK-24777
> URL: https://issues.apache.org/jira/browse/SPARK-24777
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25450) PushProjectThroughUnion rule uses the same exprId for project expressions in each Union child, causing mistakes in constant propagation

2018-09-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25450.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0
   2.3.3

> PushProjectThroughUnion rule uses the same exprId for project expressions in 
> each Union child, causing mistakes in constant propagation
> ---
>
> Key: SPARK-25450
> URL: https://issues.apache.org/jira/browse/SPARK-25450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> The problem was cause by the PushProjectThroughUnion rule, which, when 
> creating new Project for each child of Union, uses the same exprId for 
> expressions of the same position. This is wrong because, for each child of 
> Union, the expressions are all independent, and it can lead to a wrong result 
> if other rules like FoldablePropagation kicks in, taking two different 
> expressions as the same.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25419) Parquet predicate pushdown improvement

2018-09-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25419:
---

Assignee: Yuming Wang

> Parquet predicate pushdown improvement
> --
>
> Key: SPARK-25419
> URL: https://issues.apache.org/jira/browse/SPARK-25419
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Parquet predicate pushdown support: ByteType, ShortType, DecimalType, 
> DateType, TimestampType. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25458) Support FOR ALL COLUMNS in ANALYZE TABLE

2018-09-18 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25458:
---

 Summary: Support FOR ALL COLUMNS in ANALYZE TABLE 
 Key: SPARK-25458
 URL: https://issues.apache.org/jira/browse/SPARK-25458
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: Xiao Li


Currently, to collect the statistics of all the columns, users need to specify 
the names of all the columns when calling the command "ANALYZE TABLE ... FOR 
COLUMNS...". This is not user friendly. Instead, we can introduce the following 
SQL command to achieve it without specifying the column names.
{code:java}
   ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24151) CURRENT_DATE, CURRENT_TIMESTAMP incorrectly resolved as column names when caseSensitive is enabled

2018-09-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24151.
-
   Resolution: Fixed
 Assignee: James Thompson
Fix Version/s: 2.4.0

> CURRENT_DATE, CURRENT_TIMESTAMP incorrectly resolved as column names when 
> caseSensitive is enabled
> --
>
> Key: SPARK-24151
> URL: https://issues.apache.org/jira/browse/SPARK-24151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: James Thompson
>Assignee: James Thompson
>Priority: Major
> Fix For: 2.4.0
>
>
> After this change: https://issues.apache.org/jira/browse/SPARK-22333
> Running SQL such as "CURRENT_TIMESTAMP" can fail spark.sql.caseSensitive has 
> been enabled:
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`CURRENT_TIMESTAMP`' 
> given input columns: [col1]{code}
> This is due to the fact that the analyzer incorrectly uses a case sensitive 
> resolver to resolve the function. I will submit a PR with a fix + test for 
> this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    6   7   8   9   10   11   12   13   14   15   >