[jira] [Commented] (SPARK-28845) Enable spark.sql.execution.sortBeforeRepartition only for retried stages

2020-02-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035107#comment-17035107
 ] 

Wenchen Fan commented on SPARK-28845:
-

I'm a little hesitant to abandon the sort approach completely. If a stage has 
many tasks, always retry the entire stage may end up with never finishing it 
and keep retrying.

Performance-wise, I think it's better to combine the sort and retry approaches. 
But as [~XuanYuan] said, this is too difficult and we didn't make it.

> Enable spark.sql.execution.sortBeforeRepartition only for retried stages
> 
>
> Key: SPARK-28845
> URL: https://issues.apache.org/jira/browse/SPARK-28845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> For fixing the correctness bug of SPARK-28699, we disable radix sort for the 
> scenario of repartition in Spark SQL. This will cause a performance 
> regression.
> So for limiting the performance overhead, we'll do the optimizing work by 
> only enable sort for the repartition operation while stage retries happening. 
> This work depends on SPARK-25341.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()

2020-02-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30795.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/27544

> Spark SQL codegen's code() interpolator should treat escapes like Scala's 
> StringContext.s()
> ---
>
> Key: SPARK-30795
> URL: https://issues.apache.org/jira/browse/SPARK-30795
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 3.0.0
>Reporter: Kris Mok
>Priority: Major
> Fix For: 3.1.0
>
>
> The {{code()}} string interpolator in Spark SQL's code generator should treat 
> escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it 
> should treat escapes in the code parts, and should not treat escapes in the 
> input arguments.
> For example,
> {code}
> val arg = "This is an argument."
> val str = s"This is string part 1. $arg This is string part 2."
> val code = code"This is string part 1. $arg This is string part 2."
> assert(code.toString == str)
> {code}
> We should expect the {{code()}} interpolator produce the same thing as the 
> {{StringContext.s()}} interpolator, where only escapes in the string parts 
> should be treated, while the args should be kept verbatim.
> But in the current implementation, due to the eager folding of code parts and 
> literal input args, the escape treatment is incorrectly done on both code 
> parts and literal args.
> That causes a problem when an arg contains escape sequences and wants to 
> preserve that in the final produced code string. For example, in {{Like}} 
> expression's codegen, there's an ugly workaround for this bug:
> {code}
>   // We need double escape to avoid 
> org.codehaus.commons.compiler.CompileException.
>   // '\\' will cause exception 'Single quote must be backslash-escaped in 
> character literal'.
>   // '\"' will cause exception 'Line break in literal not allowed'.
>   val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') {
> s"""\\$escapeChar"""
>   } else {
> escapeChar
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25929) Support metrics with tags

2020-02-11 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035074#comment-17035074
 ] 

John Zhuge commented on SPARK-25929:


Yeah, I can feel the pain.

When I ingest into InfluxDB, I have to use its [Graphite 
templates|https://github.com/influxdata/influxdb/tree/v1.7.10/services/graphite#templates],
 e.g.,

{noformat}
"*.*.*.DAGScheduler.*.* application.app_id.executor_id.measurement.type.qty 
name=DAGScheduler",
"*.*.*.ExecutorAllocationManager.*.* 
application.app_id.executor_id.measurement.type.qty 
name=ExecutorAllocationManager",
"*.*.*.ExternalShuffle.*.* 
application.app_id.executor_id.measurement.type.qty name=ExternalShuffle",
{noformat}

Hard to get right. Easily obsolete. Doesn't support multiple versions.

> Support metrics with tags
> -
>
> Key: SPARK-25929
> URL: https://issues.apache.org/jira/browse/SPARK-25929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> For better integration with DBs that support tags/labels, e.g., InfluxDB, 
> Prometheus, Atlas, etc.
> We should continue to support the current Graphite-style metrics.
> Dropwizard Metrics v5 supports tags. It has been in RC status since Feb. 
> Currently 
> `[5.0.0-rc2|https://github.com/dropwizard/metrics/releases/tag/v5.0.0-rc2]` 
> is in Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30796) Add parameter position for REGEXP_REPLACE

2020-02-11 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30796:
---
Parent: SPARK-27764
Issue Type: Sub-task  (was: New Feature)

> Add parameter position for REGEXP_REPLACE
> -
>
> Key: SPARK-30796
> URL: https://issues.apache.org/jira/browse/SPARK-30796
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
>  
> postgresql
> {{format: regexp_replace}}(_{{source}}_, _{{pattern}}_, _{{replacement}}_ [, 
> _{{flags}}_ ]).
> reference: [https://www.postgresql.org/docs/11/functions-matching.html]
> vertica
> REGEXP_REPLACE( _string_, _target_ [, _replacement_ [, _position_ [, 
> _occurrence_ ... [, _regexp_modifiers_ ] ] ] ] )
> reference: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace]
> oracle
> [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490]
> redshift
> https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30796) Add parameter position for REGEXP_REPLACE

2020-02-11 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035066#comment-17035066
 ] 

jiaan.geng commented on SPARK-30796:


I'm working on.

> Add parameter position for REGEXP_REPLACE
> -
>
> Key: SPARK-30796
> URL: https://issues.apache.org/jira/browse/SPARK-30796
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
>  
> postgresql
> {{format: regexp_replace}}(_{{source}}_, _{{pattern}}_, _{{replacement}}_ [, 
> _{{flags}}_ ]).
> reference: [https://www.postgresql.org/docs/11/functions-matching.html]
> vertica
> REGEXP_REPLACE( _string_, _target_ [, _replacement_ [, _position_ [, 
> _occurrence_ ... [, _regexp_modifiers_ ] ] ] ] )
> reference: 
> [https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace]
> oracle
> [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490]
> redshift
> https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30796) Add parameter position for REGEXP_REPLACE

2020-02-11 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30796:
--

 Summary: Add parameter position for REGEXP_REPLACE
 Key: SPARK-30796
 URL: https://issues.apache.org/jira/browse/SPARK-30796
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


 

postgresql

{{format: regexp_replace}}(_{{source}}_, _{{pattern}}_, _{{replacement}}_ [, 
_{{flags}}_ ]).

reference: [https://www.postgresql.org/docs/11/functions-matching.html]

vertica

REGEXP_REPLACE( _string_, _target_ [, _replacement_ [, _position_ [, 
_occurrence_ ... [, _regexp_modifiers_ ] ] ] ] )

reference: 
[https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace]

oracle

[https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490]

redshift

https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30722) Document type hints in pandas UDF

2020-02-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30722:


Assignee: Hyukjin Kwon

> Document type hints in pandas UDF
> -
>
> Key: SPARK-30722
> URL: https://issues.apache.org/jira/browse/SPARK-30722
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We should document the new type hints for pandas UDF introduced at 
> SPARK-28264.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30722) Document type hints in pandas UDF

2020-02-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30722.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27466
[https://github.com/apache/spark/pull/27466]

> Document type hints in pandas UDF
> -
>
> Key: SPARK-30722
> URL: https://issues.apache.org/jira/browse/SPARK-30722
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> We should document the new type hints for pandas UDF introduced at 
> SPARK-28264.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30780) LocalRelation should use emptyRDD if it is empty

2020-02-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30780.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27530
[https://github.com/apache/spark/pull/27530]

> LocalRelation should use emptyRDD if it is empty
> 
>
> Key: SPARK-30780
> URL: https://issues.apache.org/jira/browse/SPARK-30780
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.0.0
>
>
> LocalRelation creates an RDD of a single partition when it is empty. This is 
> somewhat unexpected, and can lead to unnecessary work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30795) Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()

2020-02-11 Thread Kris Mok (Jira)
Kris Mok created SPARK-30795:


 Summary: Spark SQL codegen's code() interpolator should treat 
escapes like Scala's StringContext.s()
 Key: SPARK-30795
 URL: https://issues.apache.org/jira/browse/SPARK-30795
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 3.0.0
Reporter: Kris Mok


The {{code()}} string interpolator in Spark SQL's code generator should treat 
escapes like Scala's builtin {{StringContext.s()}} interpolator, i.e. it should 
treat escapes in the code parts, and should not treat escapes in the input 
arguments.

For example,
{code}
val arg = "This is an argument."
val str = s"This is string part 1. $arg This is string part 2."
val code = code"This is string part 1. $arg This is string part 2."
assert(code.toString == str)
{code}
We should expect the {{code()}} interpolator produce the same thing as the 
{{StringContext.s()}} interpolator, where only escapes in the string parts 
should be treated, while the args should be kept verbatim.

But in the current implementation, due to the eager folding of code parts and 
literal input args, the escape treatment is incorrectly done on both code parts 
and literal args.
That causes a problem when an arg contains escape sequences and wants to 
preserve that in the final produced code string. For example, in {{Like}} 
expression's codegen, there's an ugly workaround for this bug:
{code}
  // We need double escape to avoid 
org.codehaus.commons.compiler.CompileException.
  // '\\' will cause exception 'Single quote must be backslash-escaped in 
character literal'.
  // '\"' will cause exception 'Line break in literal not allowed'.
  val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') {
s"""\\$escapeChar"""
  } else {
escapeChar
  }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30794) Stage Level scheduling: Add ability to set off heap memory

2020-02-11 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30794:
-

 Summary: Stage Level scheduling: Add ability to set off heap memory
 Key: SPARK-30794
 URL: https://issues.apache.org/jira/browse/SPARK-30794
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


For stage level scheduling in ExecutorResourceRequests we support setting heap 
memory, pyspark memory, and memory overhead. We have no split out off heap 
memory as its own configuration so we should add it as an option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-02-11 Thread Giri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034875#comment-17034875
 ] 

Giri commented on SPARK-27913:
--

This issue doesn't exist in  *spark spark-3.0.0-preview2 and also in spark 2.3* 
  Will this fix be ported to 2.4.x branch?

 

It appears that issue is realated to spark not using the schema from the 
metastore but from the ORC files and this causes the schema mismatch and out of 
bound exception when  OrcDeserializer accesses the field that doesn't exist in 
the file.

 

I see logs like this:

 

20/02/11 14:30:38 INFO RecordReaderImpl: Reader schema not provided -- using 
file schema struct>
20/02/11 14:30:38 INFO RecordReaderImpl: Reader schema not provided -- using 
file schema struct>

 

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28845) Enable spark.sql.execution.sortBeforeRepartition only for retried stages

2020-02-11 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034755#comment-17034755
 ] 

Thomas Graves commented on SPARK-28845:
---

[~cloud_fan] [~XuanYuan] I wanted to followup on this with regards to  
[https://github.com/apache/spark/pull/25491]

It looks like this got closed because its to difficult, but with SPARK-25341 - 
do we need the sort at all?   I didn't think we did and if we do I would like 
to understand. So then I assume it comes down to performance.

> Enable spark.sql.execution.sortBeforeRepartition only for retried stages
> 
>
> Key: SPARK-28845
> URL: https://issues.apache.org/jira/browse/SPARK-28845
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> For fixing the correctness bug of SPARK-28699, we disable radix sort for the 
> scenario of repartition in Spark SQL. This will cause a performance 
> regression.
> So for limiting the performance overhead, we'll do the optimizing work by 
> only enable sort for the repartition operation while stage retries happening. 
> This work depends on SPARK-25341.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30793) Wrong truncations of timestamps before the epoch to minutes and seconds

2020-02-11 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30793:
--

 Summary: Wrong truncations of timestamps before the epoch to 
minutes and seconds
 Key: SPARK-30793
 URL: https://issues.apache.org/jira/browse/SPARK-30793
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Truncations to seconds and minutes of timestamps after the epoch are correct:
{code:sql}
spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123'), 
date_trunc('SECOND', '2020-02-11 00:01:02.789');
2020-02-11 00:01:02 2020-02-11 00:01:02
{code}
but truncations of timestamps before the epoch are incorrect:
{code:sql}
spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123'), 
date_trunc('SECOND', '1960-02-11 00:01:02.789');
1960-02-11 00:01:03 1960-02-11 00:01:03
{code}
The result must be *1960-02-11 00:01:02 1960-02-11 00:01:02*

The same for the MINUTE level:
{code:sql}
spark-sql> select date_trunc('MINUTE', '1960-02-11 00:01:01'), 
date_trunc('MINUTE', '1960-02-11 00:01:50');
1960-02-11 00:02:00 1960-02-11 00:02:00
{code}
The result must be 1960-02-11 00:01:00  1960-02-11 00:01:00



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30792) Dataframe .limit() performance improvements

2020-02-11 Thread Nathan Grand (Jira)
Nathan Grand created SPARK-30792:


 Summary: Dataframe .limit() performance improvements
 Key: SPARK-30792
 URL: https://issues.apache.org/jira/browse/SPARK-30792
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Nathan Grand


It seems that
{code:java}
.limit(){code}
is much less efficient than it could be/one would expect when reading a large 
dataset from parquet:
{code:java}
val sample = spark.read.parquet("/Some/Large/Data.parquet").limit(1000)
// Do something with sample ...{code}
This might take hours, depending on the size of the data.

By comparison,
{code:java}
spark.read.parquet("/Some/Large/Data.parquet").show(1000){code}
is essentially instant.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30783) Hive 2.3 profile should exclude hive-service-rpc

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30783.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27533
[https://github.com/apache/spark/pull/27533]

> Hive 2.3 profile should exclude hive-service-rpc
> 
>
> Key: SPARK-30783
> URL: https://issues.apache.org/jira/browse/SPARK-30783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: hive-service-rpc-2.3.6-classes, 
> spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364-classes
>
>
> hive-service-rpc 2.3.6 and spark sql's thrift server module have duplicate 
> classes. Leaving hive-service-rpc 2.3.6 in the class path means that spark 
> can pick up classes defined in hive instead of its thrift server module, 
> which can cause hard to debug runtime errors due to class loading order and 
> compilation errors for applications depend on spark.
>  
> If you compare hive-service-rpc 2.3.6's jar 
> ([https://search.maven.org/remotecontent?filepath=org/apache/hive/hive-service-rpc/2.3.6/hive-service-rpc-2.3.6.jar])
>  and spark thrift server's jar (e.g. 
> [https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-hive-thriftserver_2.12/3.0.0-SNAPSHOT/spark-hive-thriftserver_2.12-3.0.0-20200207.021914-364.jar),]
>  you will see that all of classes provided by hive-service-rpc-2.3.6.jar are 
> covered by spark thrift server's jar. I am attaching the list of jar contents 
> for your reference.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27545:
---

Assignee: Rakesh Raushan  (was: hantiantian)

> Update the Documentation for CACHE TABLE and UNCACHE TABLE
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: Rakesh Raushan
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;
> we should document it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30754) Reuse results of floorDiv in calculations of floorMod in DateTimeUtils

2020-02-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30754.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27491
[https://github.com/apache/spark/pull/27491]

> Reuse results of floorDiv in calculations of floorMod in DateTimeUtils
> --
>
> Key: SPARK-30754
> URL: https://issues.apache.org/jira/browse/SPARK-30754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> A couple methods in DateTimeUtils call Math.floorDiv and Math.floorMod with 
> the same arguments. In this way, results of Math.floorDiv can be reused in 
> calculation of Math.floorMod. For example, this optimization can be applied 
> to the microsToInstant and truncDate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30754) Reuse results of floorDiv in calculations of floorMod in DateTimeUtils

2020-02-11 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30754:


Assignee: Maxim Gekk

> Reuse results of floorDiv in calculations of floorMod in DateTimeUtils
> --
>
> Key: SPARK-30754
> URL: https://issues.apache.org/jira/browse/SPARK-30754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> A couple methods in DateTimeUtils call Math.floorDiv and Math.floorMod with 
> the same arguments. In this way, results of Math.floorDiv can be reused in 
> calculation of Math.floorMod. For example, this optimization can be applied 
> to the microsToInstant and truncDate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes

2020-02-11 Thread Jelmer Kuperus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034532#comment-17034532
 ] 

Jelmer Kuperus edited comment on SPARK-27710 at 2/11/20 3:07 PM:
-

This also happens in Apache Toree

 
{code:java}
case class AttributeRow(categoryId: String, key: String, count: Long, label: 
String)

val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 


was (Author: jelmer):
This also happens in Apache Toree

 
{code:java}
val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 

> ClassNotFoundException: $line196400984558.$read$ in OuterScopes
> ---
>
> Key: SPARK-27710
> URL: https://issues.apache.org/jira/browse/SPARK-27710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> My colleague hit the following exception when using Spark in a Zeppelin 
> notebook:
> {code:java}
> java.lang.ClassNotFoundException: $line196400984558.$read$
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.sca

[jira] [Commented] (SPARK-27710) ClassNotFoundException: $line196400984558.$read$ in OuterScopes

2020-02-11 Thread Jelmer Kuperus (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034532#comment-17034532
 ] 

Jelmer Kuperus commented on SPARK-27710:


This also happens in Apache Toree

 
{code:java}
val mySpark = spark
import mySpark.implicits._
spark.read.parquet("/user/jkuperus/foo").as[AttributeRow]
  .limit(1)
  .map(r => r)
  .show()
{code}
 

Gives

 
{noformat}
StackTrace: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
 at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485){noformat}
 

> ClassNotFoundException: $line196400984558.$read$ in OuterScopes
> ---
>
> Key: SPARK-27710
> URL: https://issues.apache.org/jira/browse/SPARK-27710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> My colleague hit the following exception when using Spark in a Zeppelin 
> notebook:
> {code:java}
> java.lang.ClassNotFoundException: $line196400984558.$read$
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.prepareArguments(objects.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:483)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
>   at 
> org.apache.spark.sql.execution.DeserializeToObjectExec.doConsume(objects.scala:84)
>   at 
> org.apache.spark.sql.execu

[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-11 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034492#comment-17034492
 ] 

Jorge Machado commented on SPARK-24615:
---

Yeah, that was my question. Thanks for the response. I will look at rapid.ai 
and try to use it inside a partition or so... 

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-11 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034490#comment-17034490
 ] 

Thomas Graves commented on SPARK-24615:
---

This is purely a scheduling feature and Spark will assign GPUs to particular 
tasks. From there its the users responsibility to look at those assignments and 
do whatever they want with the GPU.  For instance you might pass it into tensor 
flow on Spark or some other ML/AI framework.

Do you mean the actual Dataset operations using GPU?   Such as doing 
df.join.groupby.filter?  

That isn't supported inside of Spark itself, nor is part of this feature. There 
was another Jira  (SPARK-27396) we added support for adding columnar plugin to 
Spark that would allow someone to write a plugin that does stuff on the GPU.  
Nvidia is working on such a plugin but it is not publicly available yet.

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE

2020-02-11 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034479#comment-17034479
 ] 

Rakesh Raushan commented on SPARK-27545:


Please assign this to me. Thanks

> Update the Documentation for CACHE TABLE and UNCACHE TABLE
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;
> we should document it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30791) Dataframe add sameResult and sementicHash method

2020-02-11 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-30791:
---
Description: 
Sometimes, we want to check whether two dataframes are the same.

There is already an internal API like:
{code:java}
df1.queryExecution.logical.sameResult(...) {code}
We can make a public API for this:

Like:
{code:java}
df1.sameResult(df2) // return true if dataframe will return the same result
df1.semanticHash // return a semantic hashcode, if the two dataframes will 
return the same results, their semantic hashcodes should be the same.{code}
CC [~cloud_fan] [~mengxr] [~liangz]

 

  was:
Sometimes, we want to check whether two dataframe is the same.

There is already an internal API like:
{code:java}
df1.queryExecution.logical.sameResult(...) {code}
We can make a public API for this:

Like:
{code:java}
df1.sameResult(df2) // return true if dataframe will return the same result
df1.semanticHash // return a semantic hashcode, if the two dataframe will 
return the same result, their semantic hashcode should be the same.{code}
CC [~cloud_fan] [~mengxr] [~liangz]

 


> Dataframe add sameResult and sementicHash method
> 
>
> Key: SPARK-30791
> URL: https://issues.apache.org/jira/browse/SPARK-30791
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Liang Zhang
>Priority: Major
>
> Sometimes, we want to check whether two dataframes are the same.
> There is already an internal API like:
> {code:java}
> df1.queryExecution.logical.sameResult(...) {code}
> We can make a public API for this:
> Like:
> {code:java}
> df1.sameResult(df2) // return true if dataframe will return the same result
> df1.semanticHash // return a semantic hashcode, if the two dataframes will 
> return the same results, their semantic hashcodes should be the same.{code}
> CC [~cloud_fan] [~mengxr] [~liangz]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30791) Dataframe add sameResult and sementicHash method

2020-02-11 Thread Weichen Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034475#comment-17034475
 ] 

Weichen Xu commented on SPARK-30791:


[~liangz] will work on this. :)

> Dataframe add sameResult and sementicHash method
> 
>
> Key: SPARK-30791
> URL: https://issues.apache.org/jira/browse/SPARK-30791
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Liang Zhang
>Priority: Major
>
> Sometimes, we want to check whether two dataframe is the same.
> There is already an internal API like:
> {code:java}
> df1.queryExecution.logical.sameResult(...) {code}
> We can make a public API for this:
> Like:
> {code:java}
> df1.sameResult(df2) // return true if dataframe will return the same result
> df1.semanticHash // return a semantic hashcode, if the two dataframe will 
> return the same result, their semantic hashcode should be the same.{code}
> CC [~cloud_fan] [~mengxr] [~liangz]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30791) Dataframe add sameResult and sementicHash method

2020-02-11 Thread Weichen Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-30791:
--

Assignee: Liang Zhang

> Dataframe add sameResult and sementicHash method
> 
>
> Key: SPARK-30791
> URL: https://issues.apache.org/jira/browse/SPARK-30791
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Liang Zhang
>Priority: Major
>
> Sometimes, we want to check whether two dataframe is the same.
> There is already an internal API like:
> {code:java}
> df1.queryExecution.logical.sameResult(...) {code}
> We can make a public API for this:
> Like:
> {code:java}
> df1.sameResult(df2) // return true if dataframe will return the same result
> df1.semanticHash // return a semantic hashcode, if the two dataframe will 
> return the same result, their semantic hashcode should be the same.{code}
> CC [~cloud_fan] [~mengxr] [~liangz]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30791) Dataframe add sameResult and sementicHash method

2020-02-11 Thread Weichen Xu (Jira)
Weichen Xu created SPARK-30791:
--

 Summary: Dataframe add sameResult and sementicHash method
 Key: SPARK-30791
 URL: https://issues.apache.org/jira/browse/SPARK-30791
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 3.0.0
Reporter: Weichen Xu


Sometimes, we want to check whether two dataframe is the same.

There is already an internal API like:
{code:java}
df1.queryExecution.logical.sameResult(...) {code}
We can make a public API for this:

Like:
{code:java}
df1.sameResult(df2) // return true if dataframe will return the same result
df1.semanticHash // return a semantic hashcode, if the two dataframe will 
return the same result, their semantic hashcode should be the same.{code}
CC [~cloud_fan] [~mengxr] [~liangz]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30790) The datatype of map() should be map

2020-02-11 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034461#comment-17034461
 ] 

Rakesh Raushan commented on SPARK-30790:


Should i expose a legacy configuration for mapType as well ??

[~hyukjin.kwon]

> The datatype of map() should be map
> --
>
> Key: SPARK-30790
> URL: https://issues.apache.org/jira/browse/SPARK-30790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Currently ,
> spark.sql("select map()") gives {}.
> To be consistent with the changes made in SPARK-29462, it should return 
> map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30790) The datatype of map() should be map

2020-02-11 Thread Rakesh Raushan (Jira)
Rakesh Raushan created SPARK-30790:
--

 Summary: The datatype of map() should be map
 Key: SPARK-30790
 URL: https://issues.apache.org/jira/browse/SPARK-30790
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Rakesh Raushan


Currently ,

spark.sql("select map()") gives {}.

To be consistent with the changes made in SPARK-29462, it should return 
map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27545:

Summary: Update the Documentation for CACHE TABLE and UNCACHE TABLE  (was: 
Uncache table needs to delete the temporary view created when the cache table 
is executed.)

> Update the Documentation for CACHE TABLE and UNCACHE TABLE
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27545:

Issue Type: Documentation  (was: Bug)

> Update the Documentation for CACHE TABLE and UNCACHE TABLE
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;
> we should document it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27545) Update the Documentation for CACHE TABLE and UNCACHE TABLE

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27545:

Description: 
spark-sql> cache table v1 as select * from a;

spark-sql> uncache table v1;

spark-sql> cache table v1 as select * from a;

2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 0: 
get_table : db=apachespark tbl=a
2019-04-23 14:50:09,038 INFO 
org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
Error in query: Temporary view 'v1' already exists;

we should document it.

  was:
spark-sql> cache table v1 as select * from a;

spark-sql> uncache table v1;

spark-sql> cache table v1 as select * from a;

2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 0: 
get_table : db=apachespark tbl=a
2019-04-23 14:50:09,038 INFO 
org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
Error in query: Temporary view 'v1' already exists;


> Update the Documentation for CACHE TABLE and UNCACHE TABLE
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;
> we should document it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27545) Uncache table needs to delete the temporary view created when the cache table is executed.

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27545.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27090
[https://github.com/apache/spark/pull/27090]

> Uncache table needs to delete the temporary view created when the cache table 
> is executed.
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
> Fix For: 3.0.0
>
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27545) Uncache table needs to delete the temporary view created when the cache table is executed.

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27545:
---

Assignee: hantiantian

> Uncache table needs to delete the temporary view created when the cache table 
> is executed.
> --
>
> Key: SPARK-27545
> URL: https://issues.apache.org/jira/browse/SPARK-27545
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Major
>
> spark-sql> cache table v1 as select * from a;
> spark-sql> uncache table v1;
> spark-sql> cache table v1 as select * from a;
> 2019-04-23 14:50:09,038 INFO org.apache.hadoop.hive.metastore.HiveMetaStore: 
> 0: get_table : db=apachespark tbl=a
> 2019-04-23 14:50:09,038 INFO 
> org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=root 
> ip=unknown-ip-addr cmd=get_table : db=apachespark tbl=a
> Error in query: Temporary view 'v1' already exists;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30326) Raise exception if analyzer exceed max iterations

2020-02-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30326:
---

Assignee: Xin Wu

> Raise exception if analyzer exceed max iterations
> -
>
> Key: SPARK-30326
> URL: https://issues.apache.org/jira/browse/SPARK-30326
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xin Wu
>Assignee: Xin Wu
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, both analyzer and optimizer just log warning message if rule 
> execution exceed max iterations. They should have different behavior. 
> Analyzer should raise exception to indicates the plan is not fixed after max 
> iterations, while optimizer just log warning to keep the current plan. This 
> is more feasible after SPARK-30138 was introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30789) Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE

2020-02-11 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30789:
--

 Summary: Support IGNORE | RESPECT) NULLS for 
LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
 Key: SPARK-30789
 URL: https://issues.apache.org/jira/browse/SPARK-30789
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE support IGNORE NULLS | RESPECT 
NULLS. For example:
{code:java}
LEAD (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code}
 
{code:java}
LAG (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code}
 
{code:java}
NTH_VALUE (expr, offset)
[ IGNORE NULLS | RESPECT NULLS ]
OVER
( [ PARTITION BY window_partition ]
[ ORDER BY window_ordering 
 frame_clause ] ){code}
 

*Oracle:*
[https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0]

*Redshift*
[https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html]

*Presto*
[https://prestodb.io/docs/current/functions/window.html]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30789) Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE

2020-02-11 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034347#comment-17034347
 ] 

jiaan.geng commented on SPARK-30789:


I will working on.

> Support IGNORE | RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
> -
>
> Key: SPARK-30789
> URL: https://issues.apache.org/jira/browse/SPARK-30789
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE support IGNORE NULLS | 
> RESPECT NULLS. For example:
> {code:java}
> LEAD (value_expr [, offset ])
> [ IGNORE NULLS | RESPECT NULLS ]
> OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code}
>  
> {code:java}
> LAG (value_expr [, offset ])
> [ IGNORE NULLS | RESPECT NULLS ]
> OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ){code}
>  
> {code:java}
> NTH_VALUE (expr, offset)
> [ IGNORE NULLS | RESPECT NULLS ]
> OVER
> ( [ PARTITION BY window_partition ]
> [ ORDER BY window_ordering 
>  frame_clause ] ){code}
>  
> *Oracle:*
> [https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0]
> *Redshift*
> [https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html]
> *Presto*
> [https://prestodb.io/docs/current/functions/window.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30786) Block replication is not retried on other BlockManagers when it fails on 1 of the peers

2020-02-11 Thread Prakhar Jain (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakhar Jain updated SPARK-30786:
-
Component/s: Spark Core

> Block replication is not retried on other BlockManagers when it fails on 1 of 
> the peers
> ---
>
> Key: SPARK-30786
> URL: https://issues.apache.org/jira/browse/SPARK-30786
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Prakhar Jain
>Priority: Major
>
> When we cache an RDD with replication > 1, Firstly the RDD block is cached 
> locally on one of the BlockManager and then it is replicated to 
> (replication-1) number of BlockManagers. While replicating a block, if 
> replication fails on one of the peers, it is supposed to retry the 
> replication on some other peer (based on 
> "spark.storage.maxReplicationFailures" config). But currently this doesn't 
> happen because of some issue.
> Logs of 1 of the executor which is trying to replicate:
> {noformat}
> 20/02/10 09:01:47 INFO Executor: Starting executor ID 1 on host 
> wn11-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net
> .
> .
> .
> 20/02/10 09:06:45 INFO Executor: Running task 244.0 in stage 3.0 (TID 550)
> 20/02/10 09:06:45 DEBUG BlockManager: Getting local block rdd_13_244
> 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 was not found
> 20/02/10 09:06:45 DEBUG BlockManager: Getting remote block rdd_13_244
> 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 not found
> 20/02/10 09:06:46 INFO MemoryStore: Block rdd_13_244 stored as values in 
> memory (estimated size 33.3 MB, free 44.2 MB)
> 20/02/10 09:06:46 DEBUG BlockManager: Told master about block rdd_13_244
> 20/02/10 09:06:46 DEBUG BlockManager: Put block rdd_13_244 locally took  947 
> ms
> 20/02/10 09:06:46 DEBUG BlockManager: Level for block rdd_13_244 is 
> StorageLevel(memory, deserialized, 3 replicas)
> 20/02/10 09:06:46 TRACE BlockManager: Trying to replicate rdd_13_244 of 
> 34908552 bytes to BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None)
> 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes 
> to BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) 
> in 205.849858 ms
> 20/02/10 09:06:47 TRACE BlockManager: Trying to replicate rdd_13_244 of 
> 34908552 bytes to BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None)
> 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes 
> to BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) 
> in 180.501504 ms
> 20/02/10 09:06:47 DEBUG BlockManager: Replicating rdd_13_244 of 34908552 
> bytes to 2 peer(s) took 387.381168 ms
> 20/02/10 09:06:47 DEBUG BlockManager: block rdd_13_244 replicated to 
> BlockManagerId(5, 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None), 
> BlockManagerId(2, 
> wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None)
> 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 remotely took  423 
> ms
> 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 with 
> replication took  1371 ms
> 20/02/10 09:06:47 DEBUG BlockManager: Getting local block rdd_13_244
> 20/02/10 09:06:47 DEBUG BlockManager: Level for block rdd_13_244 is 
> StorageLevel(memory, deserialized, 3 replicas)
> 20/02/10 09:06:47 INFO Executor: Finished task 244.0 in stage 3.0 (TID 550). 
> 2253 bytes result sent to driver
> {noformat}
> Logs of other executor where the block is being replicated to:
> {noformat}
> 20/02/10 09:01:47 INFO Executor: Starting executor ID 5 on host 
> wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net
> .
> .
> .
> 20/02/10 09:06:47 INFO MemoryStore: Will not store rdd_13_244
> 20/02/10 09:06:47 WARN MemoryStore: Not enough space to cache rdd_13_244 in 
> memory! (computed 4.2 MB so far)
> 20/02/10 09:06:47 INFO MemoryStore: Memory use = 4.9 GB (blocks) + 7.3 MB 
> (scratch space shared across 2 tasks(s)) = 4.9 GB. Storage limit = 4.9 GB.
> 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 locally took  12 ms
> 20/02/10 09:06:47 WARN BlockManager: Block rdd_13_244 could not be removed as 
> it was not found on disk or in memory
> 20/02/10 09:06:47 WARN BlockManager: Putting block rdd_13_244 failed
> 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 without 
> replication took  13 ms
> {noformat}
> Note here that the block replication failed in Executor-5 with log line "Not 
> enough space to cache rdd_13_244 in memory!". But Executor-1 shows that block 
> is successfully replicated to executor-5 - "Repli

[jira] [Updated] (SPARK-30787) Add Generic Algorithm optimizer feature to spark-ml

2020-02-11 Thread louischoi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

louischoi updated SPARK-30787:
--
Target Version/s:   (was: 2.4.5)

> Add Generic Algorithm optimizer feature to spark-ml
> ---
>
> Key: SPARK-30787
> URL: https://issues.apache.org/jira/browse/SPARK-30787
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.4.5
>Reporter: louischoi
>Priority: Minor
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Hi. 
> It seems that spark does not have Generic Algoritm Optimizer.
> I think that this algorithm fit well in distributed system like spark.
> It is aimed to solve problems like Traveling Salesman Problem,graph 
> partitioning, Optimizing Network topology ... etc
>  
> Is there some reason that Spark does not include this feature?
>  
> Can i work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30788) Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters

2020-02-11 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30788:
--

 Summary: Support `SimpleDateFormat` and `FastDateFormat` as legacy 
date/timestamp formatters
 Key: SPARK-30788
 URL: https://issues.apache.org/jira/browse/SPARK-30788
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


To be absolutely sure that Spark 3.0 is compatible with 2.4 when 
spark.sql.legacy.timeParser.enabled is set to true, need to support 
SimpleDateFormat and FastDateFormat as legacy parsers/formatters in 
TimestampFormatter. 

Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp 
strings:

# DateTimeFormat in CSV/JSON datasource
# SimpleDateFormat - is used in JDBC datasource, in partitions parsing.
# SimpleDateFormat in strong mode (lenient = false). It is used by the 
date_format, from_unixtime, unix_timestamp and to_unix_timestamp functions.

Spark 3.0 should use the same parsers in those cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30787) Add Generic Algorithm optimizer feature to spark-ml

2020-02-11 Thread louischoi (Jira)
louischoi created SPARK-30787:
-

 Summary: Add Generic Algorithm optimizer feature to spark-ml
 Key: SPARK-30787
 URL: https://issues.apache.org/jira/browse/SPARK-30787
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.4.5
Reporter: louischoi


Hi. 

It seems that spark does not have Generic Algoritm Optimizer.

I think that this algorithm fit well in distributed system like spark.

It is aimed to solve problems like Traveling Salesman Problem,graph 
partitioning, Optimizing Network topology ... etc

 

Is there some reason that Spark does not include this feature?

 

Can i work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-11 Thread Jorge Machado (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034277#comment-17034277
 ] 

Jorge Machado commented on SPARK-24615:
---

[~tgraves] thanks for the input. It would be great to have one or two examples 
on how to use the GPUs within a dataset. 

I tried to figure out the api but I did not find any useful docs. Any tip?

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29474) CLI support for Spark-on-Docker-on-Yarn

2020-02-11 Thread Abhijeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034244#comment-17034244
 ] 

Abhijeet Singh commented on SPARK-29474:


Thanks for this feature suggestion [~adam.antal]. Is docker image flag 
({{--docker-image}}) intended to support local/offline docker images (tar 
files)?

> CLI support for Spark-on-Docker-on-Yarn
> ---
>
> Key: SPARK-29474
> URL: https://issues.apache.org/jira/browse/SPARK-29474
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, YARN
>Affects Versions: 3.0.0
>Reporter: Adam Antal
>Priority: Major
>
> The Docker-on-Yarn feature is stable for a while now in Hadoop.
> One can run Spark on Docker using the Docker-on-Yarn feature by providing 
> runtime environments to the Spark AM and Executor containers similar to this:
> {noformat}
> --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
> --conf 
> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag
> --conf 
> spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro"
> --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
> --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=repo/image:tag
> --conf 
> spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro"
> {noformat}
> This is not very user friendly. I suggest to add CLI options to specify:
> - whether docker image should be used ({{--docker}})
> - which docker image should be used ({{--docker-image}})
> - what docker mounts should be used ({{--docker-mounts}})
> for the AM and executor containers separately.
> Let's discuss!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29462) The data type of "array()" should be array

2020-02-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29462.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27521
[https://github.com/apache/spark/pull/27521]

> The data type of "array()" should be array
> 
>
> Key: SPARK-29462
> URL: https://issues.apache.org/jira/browse/SPARK-29462
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the current implmentation:
> > spark.sql("select array()")
> res0: org.apache.spark.sql.DataFrame = [array(): array]
> The output type should be array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29462) The data type of "array()" should be array

2020-02-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29462:


Assignee: Hyukjin Kwon

> The data type of "array()" should be array
> 
>
> Key: SPARK-29462
> URL: https://issues.apache.org/jira/browse/SPARK-29462
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> In the current implmentation:
> > spark.sql("select array()")
> res0: org.apache.spark.sql.DataFrame = [array(): array]
> The output type should be array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org