date:20200405

[jira] [Updated] (SPARK-31156) DataFrameStatFunctions API is not consistent with respect to Column type

2020-04-05 Thread Oleksii Kachaiev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksii Kachaiev updated SPARK-31156:
-
Description: 
Some functions from {{org.apache.spark.sql.DataFrameStatFunctions}} class 
accepts {{org.apache.spark.sql.Column}} as an argument:
 * {{bloomFilter}}
 * {{countMinSketch}}
 * sampleBy

When the rest of the functions accept only {{String}} (or collections of 
{{String}}'s respectively):
 * {{approxQuantile}}
 * {{corr}}
 * {{cov}}
 * {{crosstab}}
 * {{freqItems}}
 * {{sampleBy}}

  was:
Some functions from {{org.apache.spark.sql.DataFrameStatFunctions}} class 
accepts {{org.apache.spark.sql.Column}} as an argument:
 * {{bloomFilter}}
 * {{countMinSketch}}

When the rest of the functions accept only {{String}} (or collections of 
{{String}}'s respectively):
 * {{approxQuantile}}
 * {{corr}}
 * {{cov}}
 * {{crosstab}}
 * {{freqItems}}
 * {{sampleBy}}


> DataFrameStatFunctions API is not consistent with respect to Column type
> 
>
> Key: SPARK-31156
> URL: https://issues.apache.org/jira/browse/SPARK-31156
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Oleksii Kachaiev
>Priority: Minor
>
> Some functions from {{org.apache.spark.sql.DataFrameStatFunctions}} class 
> accepts {{org.apache.spark.sql.Column}} as an argument:
>  * {{bloomFilter}}
>  * {{countMinSketch}}
>  * sampleBy
> When the rest of the functions accept only {{String}} (or collections of 
> {{String}}'s respectively):
>  * {{approxQuantile}}
>  * {{corr}}
>  * {{cov}}
>  * {{crosstab}}
>  * {{freqItems}}
>  * {{sampleBy}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31156) DataFrameStatFunctions API is not consistent with respect to Column type

2020-04-05 Thread Oleksii Kachaiev (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksii Kachaiev updated SPARK-31156:
-
Description: 
Some functions from {{org.apache.spark.sql.DataFrameStatFunctions}} class 
accepts {{org.apache.spark.sql.Column}} as an argument:
 * {{bloomFilter}}
 * {{countMinSketch}}
 * {{sampleBy}}

When the rest of the functions accept only {{String}} (or collections of 
{{String}}'s respectively):
 * {{approxQuantile}}
 * {{corr}}
 * {{cov}}
 * {{crosstab}}
 * {{freqItems}}

  was:
Some functions from {{org.apache.spark.sql.DataFrameStatFunctions}} class 
accepts {{org.apache.spark.sql.Column}} as an argument:
 * {{bloomFilter}}
 * {{countMinSketch}}
 * sampleBy

When the rest of the functions accept only {{String}} (or collections of 
{{String}}'s respectively):
 * {{approxQuantile}}
 * {{corr}}
 * {{cov}}
 * {{crosstab}}
 * {{freqItems}}
 * {{sampleBy}}


> DataFrameStatFunctions API is not consistent with respect to Column type
> 
>
> Key: SPARK-31156
> URL: https://issues.apache.org/jira/browse/SPARK-31156
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Oleksii Kachaiev
>Priority: Minor
>
> Some functions from {{org.apache.spark.sql.DataFrameStatFunctions}} class 
> accepts {{org.apache.spark.sql.Column}} as an argument:
>  * {{bloomFilter}}
>  * {{countMinSketch}}
>  * {{sampleBy}}
> When the rest of the functions accept only {{String}} (or collections of 
> {{String}}'s respectively):
>  * {{approxQuantile}}
>  * {{corr}}
>  * {{cov}}
>  * {{crosstab}}
>  * {{freqItems}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31352) Add .asf.yaml to control Github settings

2020-04-05 Thread Kaxil Naik (Jira)

Kaxil Naik created SPARK-31352:
--

 Summary: Add .asf.yaml to control Github settings
 Key: SPARK-31352
 URL: https://issues.apache.org/jira/browse/SPARK-31352
 Project: Spark
  Issue Type: Task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Kaxil Naik


We added .asf.yaml file to Apache Airflow project 
([PR|https://github.com/apache/airflow/pull/6689]) and I think it would be good 
to have the Spark's website at the top of Github repo.

Also, this would allow Spark's PMC Members and committer to control common 
Github project settings by themselves without having to ask Apache INFRA team.

More info: 
https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories

We (Airflow PMC and Committers) used this file to enable Github issues and 
disable PR Merge button etc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

2020-04-05 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31353:
--

 Summary: Set time zone in DateTimeBenchmark and 
DateTimeRebaseBenchmark
 Key: SPARK-31353
 URL: https://issues.apache.org/jira/browse/SPARK-31353
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Performance of date-time function can depend on the system JVM time zone or SQL 
config spark.sql.session.timeZone. To avoid any fluctuations of benchmarks 
results, the ticket aims to set a time zone explicitly in date-time benchmarks 
DateTimeBenchmark and DateTimeRebaseBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31354) SparkSession Lifecycle methods to fix memory leak

2020-04-05 Thread Vinoo Ganesh (Jira)

Vinoo Ganesh created SPARK-31354:


 Summary: SparkSession Lifecycle methods to fix memory leak
 Key: SPARK-31354
 URL: https://issues.apache.org/jira/browse/SPARK-31354
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Vinoo Ganesh


Follow up to https://issues.apache.org/jira/browse/SPARK-27958 after discussion 
on [https://github.com/apache/spark/pull/24807]. 

 

Let's instead expose methods that allow the user to manually clean up 
(terminate) a SparkSession, that also remove the listenerState from the 
context. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

2020-04-05 Thread Vinoo Ganesh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17075850#comment-17075850
 ] 

Vinoo Ganesh commented on SPARK-27958:
--

Given the discussion on the PR, I filed 
https://issues.apache.org/jira/browse/SPARK-31354. I'll put up a new PR to 
address the addition of these non-behavior changing lifecycle methods. 

> Stopping a SparkSession should not always stop Spark Context
> 
>
> Key: SPARK-27958
> URL: https://issues.apache.org/jira/browse/SPARK-27958
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
>
> Creating a ticket to track the discussion here: 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/201904.mbox/%3CCAO4re1=Nk1E1VwGzSZwQ5x0SY=_heupmed8n5yydccml_t5...@mail.gmail.com%3E]
> Right now, stopping a SparkSession stops the underlying SparkContext. This 
> behavior is not ideal and doesn't really make sense. Stopping a SparkSession 
> should only stop the SparkContext in the event that the is the only session. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31355) Document TABLESAMPLE in SQL Reference

2020-04-05 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31355:
--

 Summary: Document TABLESAMPLE in SQL Reference
 Key: SPARK-31355
 URL: https://issues.apache.org/jira/browse/SPARK-31355
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document TABLESAMPLE in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31356) KeyValueGroupedDataset method to reduce and take values only

2020-04-05 Thread Martin Loncaric (Jira)

Martin Loncaric created SPARK-31356:
---

 Summary: KeyValueGroupedDataset method to reduce and take values 
only
 Key: SPARK-31356
 URL: https://issues.apache.org/jira/browse/SPARK-31356
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Martin Loncaric


Problem: in Datasets API, it is a very common pattern to do something like this 
whenever a complex reduce function is needed:

{code:scala}
ds
  .groupByKey(_.y)
  .reduceGroups((a, b) => {...})
  .map(_._2)
{code}

However, the .map(_._2) step (taking values and throwing keys away) 
unfortunately often ends up as an unnecessary serialization during aggregation 
step, followed by {{DeserializeToObject + MapElements (from (K, V) => V) + 
SerializeFromObject}} in the optimized logical plan. In this example, it would 
be more ideal something like {{Project (from (K, V) => V)}} or . Even manually 
doing a `.select(...).as[T]` to replace the `.map` is quite tricky, because
* the columns are complicated, like {{[value, 
ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
* it breaks the nice type checking of Datasets

Proposal:
Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
{{KeyValueGroupedDataset.cogroup}} append add both an {{Aggregate node}} and a 
{{SerializeFromObject}} node so that the Optimizer can eliminate the 
serialization when it is redundant. Change aggregations to emit deserialized 
results.

I had 2 ideas for what we could change: either add a new feature to 
{{.reduceGroupValues}} that projects to only the necessary columns, or do this 
improvement. I thought this would be a better solution because
* it will improve the performance of existing Spark applications with no 
modifications
* feature growth is undesirable

Uncertainties:
Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
Complications: Are there any hazards in splitting Aggregation into Aggregation 
+ SerializeFromObject that I'm not aware of?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31356) KeyValueGroupedDataset method to reduce and take values only

2020-04-05 Thread Martin Loncaric (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-31356:

Description: 
Problem: in Datasets API, it is a very common pattern to do something like this 
whenever a complex reduce function is needed:

{code:scala}
ds
  .groupByKey(_.y)
  .reduceGroups((a, b) => {...})
  .map(_._2)
{code}

However, the .map(_._2) step (taking values and throwing keys away) 
unfortunately often ends up as an unnecessary serialization during aggregation 
step, followed by {{DeserializeToObject + MapElements (from (K, V) => V) + 
SerializeFromObject}} in the optimized logical plan. In this example, it would 
be more ideal to either skip the deserialization/serialization or {{Project 
(from (K, V) => V)}}. Even manually doing a {{.select(...).as[T]}} to replace 
the `.map` is quite tricky, because
* the columns are complicated, like {{[value, 
ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
* it breaks the nice type checking of Datasets

Proposal:
Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
{{KeyValueGroupedDataset.cogroup}} append add both an {{Aggregate node}} and a 
{{SerializeFromObject}} node so that the Optimizer can eliminate the 
serialization when it is redundant. Change aggregations to emit deserialized 
results.

I had 2 ideas for what we could change: either add a new feature to 
{{.reduceGroupValues}} that projects to only the necessary columns, or do this 
improvement. I thought this would be a better solution because
* it will improve the performance of existing Spark applications with no 
modifications
* feature growth is undesirable

Uncertainties:
Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
Complications: Are there any hazards in splitting Aggregation into Aggregation 
+ SerializeFromObject that I'm not aware of?

  was:
Problem: in Datasets API, it is a very common pattern to do something like this 
whenever a complex reduce function is needed:

{code:scala}
ds
  .groupByKey(_.y)
  .reduceGroups((a, b) => {...})
  .map(_._2)
{code}

However, the .map(_._2) step (taking values and throwing keys away) 
unfortunately often ends up as an unnecessary serialization during aggregation 
step, followed by {{DeserializeToObject + MapElements (from (K, V) => V) + 
SerializeFromObject}} in the optimized logical plan. In this example, it would 
be more ideal something like {{Project (from (K, V) => V)}} or . Even manually 
doing a `.select(...).as[T]` to replace the `.map` is quite tricky, because
* the columns are complicated, like {{[value, 
ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
* it breaks the nice type checking of Datasets

Proposal:
Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
{{KeyValueGroupedDataset.cogroup}} append add both an {{Aggregate node}} and a 
{{SerializeFromObject}} node so that the Optimizer can eliminate the 
serialization when it is redundant. Change aggregations to emit deserialized 
results.

I had 2 ideas for what we could change: either add a new feature to 
{{.reduceGroupValues}} that projects to only the necessary columns, or do this 
improvement. I thought this would be a better solution because
* it will improve the performance of existing Spark applications with no 
modifications
* feature growth is undesirable

Uncertainties:
Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
Complications: Are there any hazards in splitting Aggregation into Aggregation 
+ SerializeFromObject that I'm not aware of?


> KeyValueGroupedDataset method to reduce and take values only
> 
>
> Key: SPARK-31356
> URL: https://issues.apache.org/jira/browse/SPARK-31356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Problem: in Datasets API, it is a very common pattern to do something like 
> this whenever a complex reduce function is needed:
> {code:scala}
> ds
>   .groupByKey(_.y)
>   .reduceGroups((a, b) => {...})
>   .map(_._2)
> {code}
> However, the .map(_._2) step (taking values and throwing keys away) 
> unfortunately often ends up as an unnecessary serialization during 
> aggregation step, followed by {{DeserializeToObject + MapElements (from (K, 
> V) => V) + SerializeFromObject}} in the optimized logical plan. In this 
> example, it would be more ideal to either skip the 
> deserialization/serialization or {{Project (from (K, V) => V)}}. Even 
> manually doing a {{.select(...).as[T]}} to replace the `.map` is quite 
> tricky, beca

[jira] [Updated] (SPARK-31356) KeyValueGroupedDataset method to reduce and take values only

2020-04-05 Thread Martin Loncaric (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-31356:

Description: 
Problem: in Datasets API, it is a very common pattern to do something like this 
whenever a complex reduce function is needed:

{code:scala}
ds
  .groupByKey(_.y)
  .reduceGroups((a, b) => {...})
  .map(_._2)
{code}

However, the .map(_._2) step (taking values and throwing keys away) 
unfortunately often ends up as an unnecessary serialization during aggregation 
step, followed by {{DeserializeToObject + MapElements (from (K, V) => V) + 
SerializeFromObject}} in the optimized logical plan. In this example, it would 
be more ideal to either skip the deserialization/serialization or {{Project 
(from (K, V) => V)}}. Even manually doing a {{.select(...).as[T]}} to replace 
the `.map` is quite tricky, because
* the columns are complicated, like {{[value, 
ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
* it breaks the nice type checking of Datasets

Proposal:
Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
{{KeyValueGroupedDataset.cogroup}}) append add both an {{Aggregate node}} and a 
{{SerializeFromObject}} node so that the Optimizer can eliminate the 
serialization when it is redundant. Change aggregations to emit deserialized 
results.

I had 2 ideas for what we could change: either add a new feature to 
{{.reduceGroupValues}} that projects to only the necessary columns, or do this 
improvement. I thought this would be a better solution because
* it will improve the performance of existing Spark applications with no 
modifications
* feature growth is undesirable

Uncertainties:
Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
Complications: Are there any hazards in splitting Aggregation into Aggregation 
+ SerializeFromObject that I'm not aware of?

  was:
Problem: in Datasets API, it is a very common pattern to do something like this 
whenever a complex reduce function is needed:

{code:scala}
ds
  .groupByKey(_.y)
  .reduceGroups((a, b) => {...})
  .map(_._2)
{code}

However, the .map(_._2) step (taking values and throwing keys away) 
unfortunately often ends up as an unnecessary serialization during aggregation 
step, followed by {{DeserializeToObject + MapElements (from (K, V) => V) + 
SerializeFromObject}} in the optimized logical plan. In this example, it would 
be more ideal to either skip the deserialization/serialization or {{Project 
(from (K, V) => V)}}. Even manually doing a {{.select(...).as[T]}} to replace 
the `.map` is quite tricky, because
* the columns are complicated, like {{[value, 
ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
* it breaks the nice type checking of Datasets

Proposal:
Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
{{KeyValueGroupedDataset.cogroup}} append add both an {{Aggregate node}} and a 
{{SerializeFromObject}} node so that the Optimizer can eliminate the 
serialization when it is redundant. Change aggregations to emit deserialized 
results.

I had 2 ideas for what we could change: either add a new feature to 
{{.reduceGroupValues}} that projects to only the necessary columns, or do this 
improvement. I thought this would be a better solution because
* it will improve the performance of existing Spark applications with no 
modifications
* feature growth is undesirable

Uncertainties:
Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
Complications: Are there any hazards in splitting Aggregation into Aggregation 
+ SerializeFromObject that I'm not aware of?


> KeyValueGroupedDataset method to reduce and take values only
> 
>
> Key: SPARK-31356
> URL: https://issues.apache.org/jira/browse/SPARK-31356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Problem: in Datasets API, it is a very common pattern to do something like 
> this whenever a complex reduce function is needed:
> {code:scala}
> ds
>   .groupByKey(_.y)
>   .reduceGroups((a, b) => {...})
>   .map(_._2)
> {code}
> However, the .map(_._2) step (taking values and throwing keys away) 
> unfortunately often ends up as an unnecessary serialization during 
> aggregation step, followed by {{DeserializeToObject + MapElements (from (K, 
> V) => V) + SerializeFromObject}} in the optimized logical plan. In this 
> example, it would be more ideal to either skip the 
> deserialization/serialization or {{Project (from (K, V) => V)}}. Even 
> manually doing a {{.select(...).as[T]}} to replac

[jira] [Created] (SPARK-31357) Catalog API for View Metadata

2020-04-05 Thread John Zhuge (Jira)

John Zhuge created SPARK-31357:
--

 Summary: Catalog API for View Metadata
 Key: SPARK-31357
 URL: https://issues.apache.org/jira/browse/SPARK-31357
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: John Zhuge


SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided 
table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata.

Details in 
[SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30921) Error using VectorAssembler after Pandas GROUPED_AGG UDF

2020-04-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30921.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28089

> Error using VectorAssembler after Pandas GROUPED_AGG UDF
> 
>
> Key: SPARK-30921
> URL: https://issues.apache.org/jira/browse/SPARK-30921
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.4
> Environment: numpy==1.16.4
> pandas==0.23.4
> py4j==0.10.7
> pyarrow==0.8.0
> pyspark==2.4.4
> scikit-learn==0.19.1
> scipy==1.1.0
>Reporter: Tim Kellogg
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: test_dyn_pandas_function.py
>
>
> Using VectorAssembler after a Pandas GROUPED_AGG and join causes an opaque 
> error:
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: apply_impl(input[1, struct, true].val)
> However, inserting a .cache() between the VectorAssembler and join seems to 
> prevent VectorAssembler & Pandas UDF from interacting to cause this error.
>  
> {{E py4j.protocol.Py4JJavaError: An error occurred while calling 
> o259.collectToPython.}}
> {{E : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> execute, tree:}}
> {{E Exchange hashpartitioning(foo_id_SummaryAggregator_AOG2FHR#34L, 4)}}
> {{E +- *(4) Filter AtLeastNNulls(n, 
> apply_impl(foo_explode_SummaryAggregator_AOG2FHR#20.val),apply_impl(foo_explode_SummaryAggregator_AOG2FHR#20.val))}}
> {{E +- Generate explode(foo#11), [foo_id_SummaryAggregator_AOG2FHR#34L], 
> true, [foo_explode_SummaryAggregator_AOG2FHR#20]}}
> {{E +- *(3) Project [foo#11, monotonically_increasing_id() AS 
> foo_id_SummaryAggregator_AOG2FHR#34L]}}
> {{E +- Scan ExistingRDD[foo#11,id#12L]}}
> {{E }}
> {{E at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
> {{E at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)}}
> {{E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)}}
> {{E at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)}}
> {{E at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)}}
> {{E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.python.AggregateInPandasExec.doExecute(AggregateInPandasExec.scala:80)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)}}
> {{E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)}}
> {{E at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)}}
> {{E at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.w

[jira] [Assigned] (SPARK-30921) Error using VectorAssembler after Pandas GROUPED_AGG UDF

2020-04-05 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-30921:
---

Assignee: L. C. Hsieh

> Error using VectorAssembler after Pandas GROUPED_AGG UDF
> 
>
> Key: SPARK-30921
> URL: https://issues.apache.org/jira/browse/SPARK-30921
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.4
> Environment: numpy==1.16.4
> pandas==0.23.4
> py4j==0.10.7
> pyarrow==0.8.0
> pyspark==2.4.4
> scikit-learn==0.19.1
> scipy==1.1.0
>Reporter: Tim Kellogg
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: test_dyn_pandas_function.py
>
>
> Using VectorAssembler after a Pandas GROUPED_AGG and join causes an opaque 
> error:
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: apply_impl(input[1, struct, true].val)
> However, inserting a .cache() between the VectorAssembler and join seems to 
> prevent VectorAssembler & Pandas UDF from interacting to cause this error.
>  
> {{E py4j.protocol.Py4JJavaError: An error occurred while calling 
> o259.collectToPython.}}
> {{E : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> execute, tree:}}
> {{E Exchange hashpartitioning(foo_id_SummaryAggregator_AOG2FHR#34L, 4)}}
> {{E +- *(4) Filter AtLeastNNulls(n, 
> apply_impl(foo_explode_SummaryAggregator_AOG2FHR#20.val),apply_impl(foo_explode_SummaryAggregator_AOG2FHR#20.val))}}
> {{E +- Generate explode(foo#11), [foo_id_SummaryAggregator_AOG2FHR#34L], 
> true, [foo_explode_SummaryAggregator_AOG2FHR#20]}}
> {{E +- *(3) Project [foo#11, monotonically_increasing_id() AS 
> foo_id_SummaryAggregator_AOG2FHR#34L]}}
> {{E +- Scan ExistingRDD[foo#11,id#12L]}}
> {{E }}
> {{E at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}}
> {{E at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)}}
> {{E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)}}
> {{E at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)}}
> {{E at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)}}
> {{E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.python.AggregateInPandasExec.doExecute(AggregateInPandasExec.scala:80)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)}}
> {{E at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)}}
> {{E at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)}}
> {{E at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)}}
> {{E at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)}}
> {{E at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}}

[jira] [Updated] (SPARK-31356) Splitting Aggregate node into separate Aggregate and Serialize for Optimizer

2020-04-05 Thread Martin Loncaric (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-31356:

Summary: Splitting Aggregate node into separate Aggregate and Serialize for 
Optimizer  (was: KeyValueGroupedDataset method to reduce and take values only)

> Splitting Aggregate node into separate Aggregate and Serialize for Optimizer
> 
>
> Key: SPARK-31356
> URL: https://issues.apache.org/jira/browse/SPARK-31356
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Martin Loncaric
>Priority: Major
>
> Problem: in Datasets API, it is a very common pattern to do something like 
> this whenever a complex reduce function is needed:
> {code:scala}
> ds
>   .groupByKey(_.y)
>   .reduceGroups((a, b) => {...})
>   .map(_._2)
> {code}
> However, the .map(_._2) step (taking values and throwing keys away) 
> unfortunately often ends up as an unnecessary serialization during 
> aggregation step, followed by {{DeserializeToObject + MapElements (from (K, 
> V) => V) + SerializeFromObject}} in the optimized logical plan. In this 
> example, it would be more ideal to either skip the 
> deserialization/serialization or {{Project (from (K, V) => V)}}. Even 
> manually doing a {{.select(...).as[T]}} to replace the `.map` is quite 
> tricky, because
> * the columns are complicated, like {{[value, 
> ReduceAggregator(my.data.type)]}}, and seem to be impossible to {{.select}}
> * it breaks the nice type checking of Datasets
> Proposal:
> Change the {{KeyValueGroupedDataset.aggUntyped}} method to (like 
> {{KeyValueGroupedDataset.cogroup}}) append add both an {{Aggregate node}} and 
> a {{SerializeFromObject}} node so that the Optimizer can eliminate the 
> serialization when it is redundant. Change aggregations to emit deserialized 
> results.
> I had 2 ideas for what we could change: either add a new feature to 
> {{.reduceGroupValues}} that projects to only the necessary columns, or do 
> this improvement. I thought this would be a better solution because
> * it will improve the performance of existing Spark applications with no 
> modifications
> * feature growth is undesirable
> Uncertainties:
> Affects Version: I'm not sure - if I submit a PR soon, can we get this into 
> 3.0? Or only 3.1? And I assume we're not adding new features to 2.4?
> Complications: Are there any hazards in splitting Aggregation into 
> Aggregation + SerializeFromObject that I'm not aware of?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31353:
---

Assignee: Maxim Gekk

> Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark
> --
>
> Key: SPARK-31353
> URL: https://issues.apache.org/jira/browse/SPARK-31353
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Performance of date-time function can depend on the system JVM time zone or 
> SQL config spark.sql.session.timeZone. To avoid any fluctuations of 
> benchmarks results, the ticket aims to set a time zone explicitly in 
> date-time benchmarks DateTimeBenchmark and DateTimeRebaseBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31353.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28127
[https://github.com/apache/spark/pull/28127]

> Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark
> --
>
> Key: SPARK-31353
> URL: https://issues.apache.org/jira/browse/SPARK-31353
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Performance of date-time function can depend on the system JVM time zone or 
> SQL config spark.sql.session.timeZone. To avoid any fluctuations of 
> benchmarks results, the ticket aims to set a time zone explicitly in 
> date-time benchmarks DateTimeBenchmark and DateTimeRebaseBenchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2020-04-05 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076044#comment-17076044
 ] 

Wenchen Fan commented on SPARK-25102:
-

I don't plan to have more releases, but 2.4.6 is not released yet, right? Maybe 
"we will maintain the 2.4 line for a long time" is not accurate, should be "the 
2.4 line will still be used by many people for a long time".

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31224) Support views in both SHOW CREATE TABLE and SHOW CREATE TABLE AS SERDE

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31224.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27984
[https://github.com/apache/spark/pull/27984]

> Support views in both SHOW CREATE TABLE and SHOW CREATE TABLE AS SERDE
> --
>
> Key: SPARK-31224
> URL: https://issues.apache.org/jira/browse/SPARK-31224
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> For now {{SHOW CREATE TABLE}} command doesn't support views, but {{SHOW 
> CREATE TABLE AS SERDE}} supports it. Since the views syntax are the same 
> between Hive DDL and Spark DDL, we should be able to support views in both 
> two commands.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31343:
---

Assignee: Maxim Gekk

> Check codegen does not fail on expressions with special characters in string 
> parameters
> ---
>
> Key: SPARK-31343
> URL: https://issues.apache.org/jira/browse/SPARK-31343
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Add tests similar to tests added by the PR 
> https://github.com/apache/spark/pull/20182 for from_utc_timestamp / 
> to_utc_timestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31343.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28115
[https://github.com/apache/spark/pull/28115]

> Check codegen does not fail on expressions with special characters in string 
> parameters
> ---
>
> Key: SPARK-31343
> URL: https://issues.apache.org/jira/browse/SPARK-31343
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add tests similar to tests added by the PR 
> https://github.com/apache/spark/pull/20182 for from_utc_timestamp / 
> to_utc_timestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31316.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28105
[https://github.com/apache/spark/pull/28105]

> SQLQueryTestSuite: Display the total generate time for generated java code.
> ---
>
> Key: SPARK-31316
> URL: https://issues.apache.org/jira/browse/SPARK-31316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> SQLQueryTestSuite spent a lot of time generate java code when using whole 
> codeine.
> We should display the total generate time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.

2020-04-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31316:
---

Assignee: jiaan.geng

> SQLQueryTestSuite: Display the total generate time for generated java code.
> ---
>
> Key: SPARK-31316
> URL: https://issues.apache.org/jira/browse/SPARK-31316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> SQLQueryTestSuite spent a lot of time generate java code when using whole 
> codeine.
> We should display the total generate time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31358) Document aggregate filters in SQL document

2020-04-05 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-31358:


 Summary: Document aggregate filters in SQL document
 Key: SPARK-31358
 URL: https://issues.apache.org/jira/browse/SPARK-31358
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31358) Document aggregate filters in SQL references

2020-04-05 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31358:
-
Summary: Document aggregate filters in SQL references  (was: Document 
aggregate filters in SQL document)

> Document aggregate filters in SQL references
> 
>
> Key: SPARK-31358
> URL: https://issues.apache.org/jira/browse/SPARK-31358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31156) DataFrameStatFunctions API is not consistent with respect to Column type

[jira] [Updated] (SPARK-31156) DataFrameStatFunctions API is not consistent with respect to Column type

[jira] [Created] (SPARK-31352) Add .asf.yaml to control Github settings

[jira] [Created] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

[jira] [Created] (SPARK-31354) SparkSession Lifecycle methods to fix memory leak

[jira] [Commented] (SPARK-27958) Stopping a SparkSession should not always stop Spark Context

[jira] [Created] (SPARK-31355) Document TABLESAMPLE in SQL Reference

[jira] [Created] (SPARK-31356) KeyValueGroupedDataset method to reduce and take values only

[jira] [Updated] (SPARK-31356) KeyValueGroupedDataset method to reduce and take values only

[jira] [Updated] (SPARK-31356) KeyValueGroupedDataset method to reduce and take values only

[jira] [Created] (SPARK-31357) Catalog API for View Metadata

[jira] [Resolved] (SPARK-30921) Error using VectorAssembler after Pandas GROUPED_AGG UDF

[jira] [Assigned] (SPARK-30921) Error using VectorAssembler after Pandas GROUPED_AGG UDF

[jira] [Updated] (SPARK-31356) Splitting Aggregate node into separate Aggregate and Serialize for Optimizer

[jira] [Assigned] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

[jira] [Resolved] (SPARK-31353) Set time zone in DateTimeBenchmark and DateTimeRebaseBenchmark

[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

[jira] [Resolved] (SPARK-31224) Support views in both SHOW CREATE TABLE and SHOW CREATE TABLE AS SERDE

[jira] [Assigned] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters

[jira] [Resolved] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters

[jira] [Resolved] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.

[jira] [Assigned] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.

[jira] [Created] (SPARK-31358) Document aggregate filters in SQL document

[jira] [Updated] (SPARK-31358) Document aggregate filters in SQL references

24 matches

Site Navigation

Mail list logo

Footer information