date:20200427



[ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094172#comment-17094172
 ] 

Michael Chirico commented on SPARK-31517:
-

> Error in ns[[i]] : subscript out of bounds

This error is coming from mutate

https://github.com/apache/spark/blob/2d3e9601b58fbe33aeedb106be7e2a1fafa2e1fd/R/pkg/R/DataFrame.R#L2294

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved



[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094153#comment-17094153
 ] 

Wenchen Fan commented on SPARK-31583:
-

cc [~maropu]

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action



 [ 
https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31589.
--
Fix Version/s: 3.0.0
 Assignee: Dongjoon Hyun
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28382

> Use `r-lib/actions/setup-r` in GitHub Action
> 
>
> Key: SPARK-31589
> URL: https://issues.apache.org/jira/browse/SPARK-31589
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> `r-lib/actions/setup-r` is more stabler and maintained 3rd party action.
> I made this issue as `Bug` since the branch is currently broken.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action



 [ 
https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31589:
--
Description: 
`r-lib/actions/setup-r` is more stabler and maintained 3rd party action.

I made this issue as `Bug` since the branch is currently broken.

  was:`r-lib/actions/setup-r` is more stabler and maintained 3rd party action.


> Use `r-lib/actions/setup-r` in GitHub Action
> 
>
> Key: SPARK-31589
> URL: https://issues.apache.org/jira/browse/SPARK-31589
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `r-lib/actions/setup-r` is more stabler and maintained 3rd party action.
> I made this issue as `Bug` since the branch is currently broken.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action



 [ 
https://issues.apache.org/jira/browse/SPARK-31589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31589:
--
Issue Type: Bug  (was: Improvement)

> Use `r-lib/actions/setup-r` in GitHub Action
> 
>
> Key: SPARK-31589
> URL: https://issues.apache.org/jira/browse/SPARK-31589
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `r-lib/actions/setup-r` is more stabler and maintained 3rd party action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31587) R installation in Github Actions is being failed



 [ 
https://issues.apache.org/jira/browse/SPARK-31587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31587.
--
Resolution: Duplicate

> R installation in Github Actions is being failed
> 
>
> Key: SPARK-31587
> URL: https://issues.apache.org/jira/browse/SPARK-31587
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, R installation seems being failed as below:
> {code}
> Get:61 https://dl.bintray.com/sbt/debian  Packages [4174 B]
> Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 
> Packages [532 B]
> Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 
> Packages [3036 B]
> Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages 
> [52.0 kB]
> Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
> amd64 Packages [33.9 kB]
> Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
> Translation-en [10.1 kB]
> Reading package lists...
> E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu 
> bionic-cran35/ Release' does not have a Release file.
> ##[error]Process completed with exit code 100.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable

2020-04-27 Thread dzcxzl (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-31590:
---
Description: 
When using SPARK-23877, some sql execution errors.

code:
{code:scala}
sql("set spark.sql.optimizer.metadataOnly=true")
sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")
sql("""
|INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
|SELECT 1,'2020-01-01','23'
|UNION ALL
|SELECT 2,'2020-01-02','01'
|UNION ALL
|SELECT 3,'2020-01-02','02'
""".stripMargin)
sql(
  s"""
 |SELECT d, MAX(h) AS h
 |FROM test_tbl
 |WHERE d= (
 |  SELECT MAX(d) AS d
 |  FROM test_tbl
 |)
 |GROUP BY d
""".stripMargin).collect()
{code}
Exception:
{code:java}
java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery#48 []

...
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
{code}
optimizedPlan:
{code:java}
Aggregate [d#245], [d#245, max(h#246) AS h#243]
+- Project [d#245, h#246]
   +- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
  :  +- Aggregate [max(d#245) AS d#241]
  : +- LocalRelation , [d#245]
  +- Relation[a#244,d#245,h#246] parquet
{code}

  was:
code:
{code:scala}
sql("set spark.sql.optimizer.metadataOnly=true")
sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")
sql("""
|INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
|SELECT 1,'2020-01-01','23'
|UNION ALL
|SELECT 2,'2020-01-02','01'
|UNION ALL
|SELECT 3,'2020-01-02','02'
""".stripMargin)
sql(
  s"""
 |SELECT d, MAX(h) AS h
 |FROM test_tbl
 |WHERE d= (
 |  SELECT MAX(d) AS d
 |  FROM test_tbl
 |)
 |GROUP BY d
""".stripMargin).collect()
{code}

Exception:
{code:java}
java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery#48 []

...
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
{code}

optimizedPlan:
{code:java}
Aggregate [d#245], [d#245, max(h#246) AS h#243]
+- Project [d#245, h#246]
   +- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
  :  +- Aggregate [max(d#245) AS d#241]
  : +- LocalRelation , [d#245]
  +- Relation[a#244,d#245,h#246] parquet
{code}





> The filter used by Metadata-only queries should not have Unevaluable
> 
>
> Key: SPARK-31590
> URL: https://issues.apache.org/jira/browse/SPARK-31590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> When using SPARK-23877, some sql execution errors.
> code:
> {code:scala}
> sql("set spark.sql.optimizer.metadataOnly=true")
> sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
> PARTITIONED BY (d ,h)")
> sql("""
> |INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
> |SELECT 1,'2020-01-01','23'
> |UNION ALL
> |SELECT 2,'2020-01-02','01'
> |UNION ALL
> |SELECT 3,'2020-01-02','02'
> """.stripMargin)
> sql(
>   s"""
>  |SELECT d, MAX(h) AS h
>  |FROM test_tbl
>  |WHERE d= (
>  |  SELECT MAX(d) AS d
>  |  FROM test_tbl
>  |)
>  |GROUP BY d
> """.stripMargin).collect()
> {code}
> Exception:
> {code:java}
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#48 []
> ...
> at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
> {code}
> optimizedPlan:
> {code:java}
> Aggregate [d#245], [d#245, max(h#246) AS h#243]
> +- Project [d#245, h#246]
>+- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
>   :  +- Aggregate [max(d#245) AS d#241]
>   : +- LocalRelation , [d#245]
>   +- Relation[a#244,d#245,h#246] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31590) The filter used by Metadata-only queries should not have Unevaluable

2020-04-27 Thread dzcxzl (Jira)

dzcxzl created SPARK-31590:
--

 Summary: The filter used by Metadata-only queries should not have 
Unevaluable
 Key: SPARK-31590
 URL: https://issues.apache.org/jira/browse/SPARK-31590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: dzcxzl


code:
{code:scala}
sql("set spark.sql.optimizer.metadataOnly=true")
sql("CREATE TABLE test_tbl (a INT,d STRING,h STRING) USING PARQUET 
PARTITIONED BY (d ,h)")
sql("""
|INSERT OVERWRITE TABLE test_tbl PARTITION(d,h)
|SELECT 1,'2020-01-01','23'
|UNION ALL
|SELECT 2,'2020-01-02','01'
|UNION ALL
|SELECT 3,'2020-01-02','02'
""".stripMargin)
sql(
  s"""
 |SELECT d, MAX(h) AS h
 |FROM test_tbl
 |WHERE d= (
 |  SELECT MAX(d) AS d
 |  FROM test_tbl
 |)
 |GROUP BY d
""".stripMargin).collect()
{code}

Exception:
{code:java}
java.lang.UnsupportedOperationException: Cannot evaluate expression: 
scalar-subquery#48 []

...
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.prunePartitions(PartitioningAwareFileIndex.scala:180)
{code}

optimizedPlan:
{code:java}
Aggregate [d#245], [d#245, max(h#246) AS h#243]
+- Project [d#245, h#246]
   +- Filter (isnotnull(d#245) AND (d#245 = scalar-subquery#242 []))
  :  +- Aggregate [max(d#245) AS d#241]
  : +- LocalRelation , [d#245]
  +- Relation[a#244,d#245,h#246] parquet
{code}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31589) Use `r-lib/actions/setup-r` in GitHub Action

Dongjoon Hyun created SPARK-31589:
-

 Summary: Use `r-lib/actions/setup-r` in GitHub Action
 Key: SPARK-31589
 URL: https://issues.apache.org/jira/browse/SPARK-31589
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.4.5, 3.0.0, 3.1.0
Reporter: Dongjoon Hyun


`r-lib/actions/setup-r` is more stabler and maintained 3rd party action.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31588) merge small files may need more common setting

2020-04-27 Thread philipse (Jira)

philipse created SPARK-31588:


 Summary: merge small files may need more common setting
 Key: SPARK-31588
 URL: https://issues.apache.org/jira/browse/SPARK-31588
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
 Environment: spark:2.4.5

hdp:2.7
Reporter: philipse


Hi ,

SparkSql now allow us to use  repartition or coalesce to manually control the 
small files like the following

/*+ REPARTITION(1) */

/*+ COALESCE(1) */

But it can only be  tuning case by case ,we need to decide whether we need to 
use COALESCE or REPARTITION,can we try a more common way to reduce the decision 
by set the target size  as hive did

*Good points:*

1)we will also the new partitions number

2)with an ON-OFF parameter  provided , user can close it if needed

3)the parmeter can be set at cluster level instand of user side,it will be more 
easier to controll samll files.

4)greatly reduce the pressue of namenode

 

*Not good points:*

1)It will add a new task to calculate the target numbers by stastics the out 
files.

 

I don't know whether we have planned this in future.

 

Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31587) R installation in Github Actions is being failed



 [ 
https://issues.apache.org/jira/browse/SPARK-31587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31587:
-
Summary: R installation in Github Actions is being failed  (was: Fixes the 
repository for downloading R in Github Actions)

> R installation in Github Actions is being failed
> 
>
> Key: SPARK-31587
> URL: https://issues.apache.org/jira/browse/SPARK-31587
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, R installation seems being failed as below:
> {code}
> Get:61 https://dl.bintray.com/sbt/debian  Packages [4174 B]
> Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 
> Packages [532 B]
> Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 
> Packages [3036 B]
> Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages 
> [52.0 kB]
> Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
> amd64 Packages [33.9 kB]
> Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
> Translation-en [10.1 kB]
> Reading package lists...
> E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu 
> bionic-cran35/ Release' does not have a Release file.
> ##[error]Process completed with exit code 100.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31587) Fixes the repository for downloading R in Github Actions

Hyukjin Kwon created SPARK-31587:


 Summary: Fixes the repository for downloading R in Github Actions
 Key: SPARK-31587
 URL: https://issues.apache.org/jira/browse/SPARK-31587
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, R
Affects Versions: 2.4.5, 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


Currently, R installation seems being failed as below:

{code}
Get:61 https://dl.bintray.com/sbt/debian  Packages [4174 B]
Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 
Packages [532 B]
Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 Packages 
[3036 B]
Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages 
[52.0 kB]
Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
amd64 Packages [33.9 kB]
Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
Translation-en [10.1 kB]
Reading package lists...
E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ 
Release' does not have a Release file.
##[error]Process completed with exit code 100.
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31585) Support Z-order curve



 [ 
https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31585:

Description: 
Z-ordering is a technique that allows you to map multidimensional data to a 
single dimension.  We can use this feature to improve query performance. 

More details:
https://en.wikipedia.org/wiki/Z-order_curve
https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/

Benchmark result:
These. 2 tables ordered and z-ordered  by AUCT_END_DT, AUCT_START_DT.
Filter on the AUCT_START_DT column:
 !zorder.png! 
 !lexicalorder.png! 
Filter on the auct_end_dt column:
 !zorder-AUCT_END_DT.png! 
 !lexicalorder-AUCT_END_DT.png! 

  was:
Z-ordering is a technique that allows you to map multidimensional data to a 
single dimension.  We can use this feature to improve query performance. 

More details:
https://en.wikipedia.org/wiki/Z-order_curve
https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/



> Support Z-order curve
> -
>
> Key: SPARK-31585
> URL: https://issues.apache.org/jira/browse/SPARK-31585
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: lexicalorder-AUCT_END_DT.png, lexicalorder.png, 
> zorder-AUCT_END_DT.png, zorder.png
>
>
> Z-ordering is a technique that allows you to map multidimensional data to a 
> single dimension.  We can use this feature to improve query performance. 
> More details:
> https://en.wikipedia.org/wiki/Z-order_curve
> https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/
> Benchmark result:
> These. 2 tables ordered and z-ordered  by AUCT_END_DT, AUCT_START_DT.
> Filter on the AUCT_START_DT column:
>  !zorder.png! 
>  !lexicalorder.png! 
> Filter on the auct_end_dt column:
>  !zorder-AUCT_END_DT.png! 
>  !lexicalorder-AUCT_END_DT.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31585) Support Z-order curve



 [ 
https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31585:

Attachment: lexicalorder-AUCT_END_DT.png

> Support Z-order curve
> -
>
> Key: SPARK-31585
> URL: https://issues.apache.org/jira/browse/SPARK-31585
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: lexicalorder-AUCT_END_DT.png, lexicalorder.png, 
> zorder-AUCT_END_DT.png, zorder.png
>
>
> Z-ordering is a technique that allows you to map multidimensional data to a 
> single dimension.  We can use this feature to improve query performance. 
> More details:
> https://en.wikipedia.org/wiki/Z-order_curve
> https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31586) Replace expression TimeSub(l, r) with TimeAdd(l -r)

2020-04-27 Thread Kent Yao (Jira)

Kent Yao created SPARK-31586:


 Summary: Replace expression TimeSub(l, r) with TimeAdd(l -r)
 Key: SPARK-31586
 URL: https://issues.apache.org/jira/browse/SPARK-31586
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


The implementation of TimeSub for the operation of timestamp subtracting 
interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, 
-r) since there are equivalent. 

Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31585) Support Z-order curve



 [ 
https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31585:

Attachment: zorder-AUCT_END_DT.png

> Support Z-order curve
> -
>
> Key: SPARK-31585
> URL: https://issues.apache.org/jira/browse/SPARK-31585
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: lexicalorder.png, zorder-AUCT_END_DT.png, zorder.png
>
>
> Z-ordering is a technique that allows you to map multidimensional data to a 
> single dimension.  We can use this feature to improve query performance. 
> More details:
> https://en.wikipedia.org/wiki/Z-order_curve
> https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31585) Support Z-order curve



 [ 
https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31585:

Attachment: lexicalorder.png

> Support Z-order curve
> -
>
> Key: SPARK-31585
> URL: https://issues.apache.org/jira/browse/SPARK-31585
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: lexicalorder.png, zorder.png
>
>
> Z-ordering is a technique that allows you to map multidimensional data to a 
> single dimension.  We can use this feature to improve query performance. 
> More details:
> https://en.wikipedia.org/wiki/Z-order_curve
> https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31585) Support Z-order curve



 [ 
https://issues.apache.org/jira/browse/SPARK-31585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31585:

Attachment: zorder.png

> Support Z-order curve
> -
>
> Key: SPARK-31585
> URL: https://issues.apache.org/jira/browse/SPARK-31585
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: zorder.png
>
>
> Z-ordering is a technique that allows you to map multidimensional data to a 
> single dimension.  We can use this feature to improve query performance. 
> More details:
> https://en.wikipedia.org/wiki/Z-order_curve
> https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31581) paste0 is always better than paste(sep="")



 [ 
https://issues.apache.org/jira/browse/SPARK-31581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31581:
-
Fix Version/s: (was: 2.4.6)

> paste0 is always better than paste(sep="")
> --
>
> Key: SPARK-31581
> URL: https://issues.apache.org/jira/browse/SPARK-31581
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
> Fix For: 3.0.0
>
>
> paste0 is available in the stated R dependency (3.1), so we should use it 
> instead of paste(sep="")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31585) Support Z-order curve

Yuming Wang created SPARK-31585:
---

 Summary: Support Z-order curve
 Key: SPARK-31585
 URL: https://issues.apache.org/jira/browse/SPARK-31585
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


Z-ordering is a technique that allows you to map multidimensional data to a 
single dimension.  We can use this feature to improve query performance. 

More details:
https://en.wikipedia.org/wiki/Z-order_curve
https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31578) Internal checkSchemaInArrow is inefficient



 [ 
https://issues.apache.org/jira/browse/SPARK-31578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31578:


Assignee: Michael Chirico

> Internal checkSchemaInArrow is inefficient
> --
>
> Key: SPARK-31578
> URL: https://issues.apache.org/jira/browse/SPARK-31578
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
>
> Current implementation is doubly inefficient:
> Repeatedly doing the same (95%) sapply loop
> Doing scalar == on a vector (== should be done over the whole vector for 
> efficiency)
> See existing PR:
> https://github.com/apache/spark/pull/28372



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31578) Internal checkSchemaInArrow is inefficient



 [ 
https://issues.apache.org/jira/browse/SPARK-31578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31578.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28372
[https://github.com/apache/spark/pull/28372]

> Internal checkSchemaInArrow is inefficient
> --
>
> Key: SPARK-31578
> URL: https://issues.apache.org/jira/browse/SPARK-31578
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
> Fix For: 3.0.0
>
>
> Current implementation is doubly inefficient:
> Repeatedly doing the same (95%) sapply loop
> Doing scalar == on a vector (== should be done over the whole vector for 
> efficiency)
> See existing PR:
> https://github.com/apache/spark/pull/28372



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31578) Internal checkSchemaInArrow is inefficient



 [ 
https://issues.apache.org/jira/browse/SPARK-31578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31578:
-
Affects Version/s: (was: 2.4.5)
   3.0.0

> Internal checkSchemaInArrow is inefficient
> --
>
> Key: SPARK-31578
> URL: https://issues.apache.org/jira/browse/SPARK-31578
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Michael Chirico
>Priority: Minor
>
> Current implementation is doubly inefficient:
> Repeatedly doing the same (95%) sapply loop
> Doing scalar == on a vector (== should be done over the whole vector for 
> efficiency)
> See existing PR:
> https://github.com/apache/spark/pull/28372



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31572) Improve task logs at executor side

2020-04-27 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-31572.
--
Resolution: Won't Fix

we shall use debug level to improve logs.

> Improve task logs at executor side
> --
>
> Key: SPARK-31572
> URL: https://issues.apache.org/jira/browse/SPARK-31572
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> In some places, task names between driver and executor have different format, 
> which brings extra difficulty for user to debug task level slowness. And we 
> can add more logs to help the debug purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31581) paste0 is always better than paste(sep="")



 [ 
https://issues.apache.org/jira/browse/SPARK-31581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31581.
--
Fix Version/s: 3.0.0
   2.4.6
 Assignee: Michael Chirico
   Resolution: Fixed

> paste0 is always better than paste(sep="")
> --
>
> Key: SPARK-31581
> URL: https://issues.apache.org/jira/browse/SPARK-31581
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>
> paste0 is available in the stated R dependency (3.1), so we should use it 
> instead of paste(sep="")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31581) paste0 is always better than paste(sep="")



[ 
https://issues.apache.org/jira/browse/SPARK-31581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094086#comment-17094086
 ] 

Hyukjin Kwon commented on SPARK-31581:
--

Fixed in https://github.com/apache/spark/pull/28374

> paste0 is always better than paste(sep="")
> --
>
> Key: SPARK-31581
> URL: https://issues.apache.org/jira/browse/SPARK-31581
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Priority: Minor
>
> paste0 is available in the stated R dependency (3.1), so we should use it 
> instead of paste(sep="")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31580) Upgrade Apache ORC to 1.5.10



 [ 
https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31580.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28373
[https://github.com/apache/spark/pull/28373]

> Upgrade Apache ORC to 1.5.10
> 
>
> Key: SPARK-31580
> URL: https://issues.apache.org/jira/browse/SPARK-31580
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31580) Upgrade Apache ORC to 1.5.10



 [ 
https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31580:
-

Assignee: Dongjoon Hyun

> Upgrade Apache ORC to 1.5.10
> 
>
> Key: SPARK-31580
> URL: https://issues.apache.org/jira/browse/SPARK-31580
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases



[ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094081#comment-17094081
 ] 

Hyukjin Kwon commented on SPARK-31538:
--

Okay I understood the intention, but Spark 3.0.0 is still a unreleased version, 
though. I am okay with porting this back - wanted to ask why you picked this up 
out of curiosity.

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore

2020-04-27 Thread Baohe Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-31584:

Attachment: errorstack.txt

> NullPointerException when parsing event log with InMemoryStore
> --
>
> Key: SPARK-31584
> URL: https://issues.apache.org/jira/browse/SPARK-31584
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: Baohe Zhang
>Priority: Minor
> Fix For: 3.0.1
>
> Attachments: errorstack.txt
>
>
> I compiled with the current branch-3.0 source and tested it in mac os. A 
> java.lang.NullPointerException will be thrown when below conditions are met: 
>  # Using InMemoryStore as kvstore when parsing the event log file (e.g., when 
> spark.history.store.path is unset). 
>  # At least one stage in this event log has task number greater than 
> spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to 
> delete extra task records.
>  # The job has more than one stage, so parentToChildrenMap in 
> InMemoryStore.java will have more than one key.
> The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In 
> the method deleteParentIndex().
> {code:java}
> private void deleteParentIndex(Object key) {
>   if (hasNaturalParentIndex) {
> for (NaturalKeys v : parentToChildrenMap.values()) {
>   if (v.remove(asKey(key))) {
> // `v` can be empty after removing the natural key and we can 
> remove it from
> // `parentToChildrenMap`. However, `parentToChildrenMap` is a 
> ConcurrentMap and such
> // checking and deleting can be slow.
> // This method is to delete one object with certain key, let's 
> make it simple here.
> break;
>   }
> }
>   }
> }{code}
> In “if (v.remove(asKey(key)))”, if the key is not contained in v,  
> "v.remove(asKey(key))" will return null, and java will throw a 
> NullPointerException when executing "if (null)".
> An exception stack trace is attached.
> This issue can be fixed by updating if statement to
> {code:java}
> if (v.remove(asKey(key)) != null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31584) NullPointerException when parsing event log with InMemoryStore

2020-04-27 Thread Baohe Zhang (Jira)

Baohe Zhang created SPARK-31584:
---

 Summary: NullPointerException when parsing event log with 
InMemoryStore
 Key: SPARK-31584
 URL: https://issues.apache.org/jira/browse/SPARK-31584
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.1
Reporter: Baohe Zhang
 Fix For: 3.0.1


I compiled with the current branch-3.0 source and tested it in mac os. A 
java.lang.NullPointerException will be thrown when below conditions are met: 
 # Using InMemoryStore as kvstore when parsing the event log file (e.g., when 
spark.history.store.path is unset). 
 # At least one stage in this event log has task number greater than 
spark.ui.retainedTasks (by default is 10). In this case, kvstore needs to 
delete extra task records.
 # The job has more than one stage, so parentToChildrenMap in 
InMemoryStore.java will have more than one key.

The java.lang.NullPointerException is thrown in InMemoryStore.java :296. In the 
method deleteParentIndex().
{code:java}
private void deleteParentIndex(Object key) {
  if (hasNaturalParentIndex) {
for (NaturalKeys v : parentToChildrenMap.values()) {
  if (v.remove(asKey(key))) {
// `v` can be empty after removing the natural key and we can 
remove it from
// `parentToChildrenMap`. However, `parentToChildrenMap` is a 
ConcurrentMap and such
// checking and deleting can be slow.
// This method is to delete one object with certain key, let's make 
it simple here.
break;
  }
}
  }
}{code}
In “if (v.remove(asKey(key)))”, if the key is not contained in v,  
"v.remove(asKey(key))" will return null, and java will throw a 
NullPointerException when executing "if (null)".

An exception stack trace is attached.

This issue can be fixed by updating if statement to
{code:java}
if (v.remove(asKey(key)) != null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31525) Inconsistent result of df.head(1) and df.head()



[ 
https://issues.apache.org/jira/browse/SPARK-31525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094028#comment-17094028
 ] 

Holden Karau commented on SPARK-31525:
--

I agree it's inconsistent, also the docs are a little misleading. I think the 
root cause is we're using head as both `peek` and `take` which is why we've got 
mixed metaphores. cc [~davies] who worked on this code most recently (2015) for 
his thoughts.

> Inconsistent result of df.head(1) and df.head()
> ---
>
> Key: SPARK-31525
> URL: https://issues.apache.org/jira/browse/SPARK-31525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Joshua Hendinata
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In this line 
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
>  if you are calling `df.head()` and dataframe is empty, it will return *None*
> but if you are calling `df.head(1)` and dataframe is empty, it will return 
> *empty list* instead.
> This particular behaviour is not consistent and can create confusion. 
> Especially when you are calling `len(df.head())` which will throw an 
> exception for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31577) fix various problems when check name conflicts of CTE relations



 [ 
https://issues.apache.org/jira/browse/SPARK-31577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31577.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28371

> fix various problems when check name conflicts of CTE relations
> ---
>
> Key: SPARK-31577
> URL: https://issues.apache.org/jira/browse/SPARK-31577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31536) Backport SPARK-25407 Allow nested access for non-existent field for Parquet file when nested pruning is enabled



 [ 
https://issues.apache.org/jira/browse/SPARK-31536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31536.
--
Resolution: Won't Fix

> Backport SPARK-25407   Allow nested access for non-existent field for 
> Parquet file when nested pruning is enabled
> -
>
> Key: SPARK-31536
> URL: https://issues.apache.org/jira/browse/SPARK-31536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Consider backporting SPARK-25407       Allow nested access for non-existent 
> field for Parquet file when nested pruning is enabled to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31536) Backport SPARK-25407 Allow nested access for non-existent field for Parquet file when nested pruning is enabled



[ 
https://issues.apache.org/jira/browse/SPARK-31536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094024#comment-17094024
 ] 

Holden Karau commented on SPARK-31536:
--

If the code path is already disabled by default then yeah let's skip the 
backport.

> Backport SPARK-25407   Allow nested access for non-existent field for 
> Parquet file when nested pruning is enabled
> -
>
> Key: SPARK-31536
> URL: https://issues.apache.org/jira/browse/SPARK-31536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Consider backporting SPARK-25407       Allow nested access for non-existent 
> field for Parquet file when nested pruning is enabled to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31537) Backport SPARK-25559 Remove the unsupported predicates in Parquet when possible



 [ 
https://issues.apache.org/jira/browse/SPARK-31537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31537.
--
Resolution: Won't Fix

per [~hyukjin.kwon]

> Backport SPARK-25559  Remove the unsupported predicates in Parquet when 
> possible
> 
>
> Key: SPARK-31537
> URL: https://issues.apache.org/jira/browse/SPARK-31537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Assignee: DB Tsai
>Priority: Major
>
> Consider backporting SPARK-25559       Remove the unsupported predicates in 
> Parquet when possible to 2.4.6



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31538) Backport SPARK-25338 Ensure to call super.beforeAll() and super.afterAll() in test cases



[ 
https://issues.apache.org/jira/browse/SPARK-31538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094022#comment-17094022
 ] 

Holden Karau commented on SPARK-31538:
--

[~hyukjin.kwon]if the Jira is already closed affecting the fix version implies 
it's already fixed for that version.

> Backport SPARK-25338   Ensure to call super.beforeAll() and 
> super.afterAll() in test cases
> --
>
> Key: SPARK-31538
> URL: https://issues.apache.org/jira/browse/SPARK-31538
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-25338       Ensure to call super.beforeAll() and 
> super.afterAll() in test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31545) Backport SPARK-27676 InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles



[ 
https://issues.apache.org/jira/browse/SPARK-31545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094020#comment-17094020
 ] 

Holden Karau commented on SPARK-31545:
--

Sounds reasonable, I'll close this as won't fix.

> Backport SPARK-27676   InMemoryFileIndex should respect 
> spark.sql.files.ignoreMissingFiles
> --
>
> Key: SPARK-31545
> URL: https://issues.apache.org/jira/browse/SPARK-31545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-27676       InMemoryFileIndex should respect 
> spark.sql.files.ignoreMissingFiles
> cc [~joshrosen] I think backporting this has been asked in the original 
> ticket, do you have any objections?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31545) Backport SPARK-27676 InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles



 [ 
https://issues.apache.org/jira/browse/SPARK-31545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31545.
--
Resolution: Won't Fix

> Backport SPARK-27676   InMemoryFileIndex should respect 
> spark.sql.files.ignoreMissingFiles
> --
>
> Key: SPARK-31545
> URL: https://issues.apache.org/jira/browse/SPARK-31545
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Holden Karau
>Priority: Major
>
> Backport SPARK-27676       InMemoryFileIndex should respect 
> spark.sql.files.ignoreMissingFiles
> cc [~joshrosen] I think backporting this has been asked in the original 
> ticket, do you have any objections?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31583) grouping_id calculation should be improved

2020-04-27 Thread Costas Piliotis (Jira)

Costas Piliotis created SPARK-31583:
---

 Summary: grouping_id calculation should be improved
 Key: SPARK-31583
 URL: https://issues.apache.org/jira/browse/SPARK-31583
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.5
Reporter: Costas Piliotis


Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
exclusion from a grouping_set rather than inclusion, when performing complex 
grouping_sets that are not in the order of the base select statement, flipping 
the bit in the grouping_id seems to be happen when the grouping set is 
identified rather than when the columns are selected in the sql.   I will of 
course use the exclusion strategy identified in SPARK-21858 as the baseline for 
this.  

 
{code:scala}
import spark.implicits._
val df= Seq(
 ("a","b","c","d"),
 ("a","b","c","d"),
 ("a","b","c","d"),
 ("a","b","c","d")
).toDF("a","b","c","d").createOrReplaceTempView("abc")
{code}
expected to have these references in the grouping_id:
 d=1
 c=2
 b=4
 a=8
{code:scala}
spark.sql("""
 select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
 from abc
 group by GROUPING SETS (
 (),
 (a,b,d),
 (a,c),
 (a,d)
 )
 """).show(false)
{code}
This returns:
{noformat}
++++++---+---+
|a   |b   |c   |d   |count(1)|gid|gid_bin|
++++++---+---+
|a   |null|c   |null|4   |6  |110|
|null|null|null|null|4   |15 |   |
|a   |null|null|d   |4   |5  |101|
|a   |b   |null|d   |4   |1  |1  |
++++++---+---+
{noformat}
 

 In other words, I would have expected the excluded values one way but I 
received them excluded in the order they were first seen in the specified 
grouping sets.

 a,b,d included = excldes c = 2; expected gid=2. received gid=1
 a,d included = excludes b=4, c=2 expected gid=6, received gid=5

The grouping_id that actually is expected is (a,b,d,c) 

{code:scala}
spark.sql("""
 select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
bin(grouping_id(a,b,d,c)) as gid_bin
 from abc
 group by GROUPING SETS (
 (),
 (a,b,d),
 (a,c),
 (a,d)
 )
 """).show(false)
{code}


 columns forming groupingid seem to be creatred as the grouping sets are 
identified rather than ordinal position in parent query.

I'd like to at least point out that grouping_id is documented in many other 
rdbms and I believe the spark project should use a policy of flipping the bits 
so 1=inclusion; 0=exclusion in the grouping set.

However many rdms that do have the feature of a grouping_id do implement it by 
the ordinal position recognized as fields in the select clause, rather than 
allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27340) Alias on TimeWIndow expression may cause watermark metadata lost



 [ 
https://issues.apache.org/jira/browse/SPARK-27340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27340.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28326
[https://github.com/apache/spark/pull/28326]

> Alias on TimeWIndow expression may cause watermark metadata lost 
> -
>
> Key: SPARK-27340
> URL: https://issues.apache.org/jira/browse/SPARK-27340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Kevin Zhang
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> When we use data api to write a structured streaming query job we usually 
> specify a watermark on event time column. If we define a window on the event 
> time column, the delayKey metadata of the event time column is supposed to be 
> propagated to the new column generated by time window expression. But if we 
> add additional alias on the time window column, the delayKey metadata is lost.
> Currently I only find the bug will affect stream-stream join with equal 
> window join keys. In terms of aggregation, the gourping expression can be 
> trimed(in CleanupAliases rule) so additional alias are removed and the 
> metadata is kept.
> Here is an example:
> {code:scala}
>   val sparkSession = SparkSession
> .builder()
> .master("local")
> .getOrCreate()
>   val rateStream = sparkSession.readStream
> .format("rate")
> .option("rowsPerSecond", 10)
> .load()
> val fooStream = rateStream
>   .select(
> col("value").as("fooId"),
> col("timestamp").as("fooTime")
>   )
>   .withWatermark("fooTime", "2 seconds")
>   .select($"fooId", $"fooTime", window($"fooTime", "2 
> seconds").alias("fooWindow"))
> val barStream = rateStream
>   .where(col("value") % 2 === 0)
>   .select(
> col("value").as("barId"),
> col("timestamp").as("barTime")
>   )
>   .withWatermark("barTime", "2 seconds")
>   .select($"barId", $"barTime", window($"barTime", "2 
> seconds").alias("barWindow"))
> val joinedDf = fooStream
>   .join(
> barStream,
> $"fooId" === $"barId" &&
>   fooStream.col("fooWindow") === barStream.col("barWindow"),
> joinType = "LeftOuter"
>   )
>   val query = joinedDf
>   .writeStream
>   .format("console")
>   .option("truncate", 100)
>   .trigger(Trigger.ProcessingTime("5 seconds"))
>   .start()
> query.awaitTermination()
> {code}
> this program will end with an exception, and from the analyzed plan we can 
> see there is no delayKey metadata on 'fooWindow'
> {code:java}
> org.apache.spark.sql.AnalysisException: Stream-stream outer join between two 
> streaming DataFrame/Datasets is not supported without a watermark in the join 
> keys, or a watermark on the nullable side and an appropriate range condition;;
> Join LeftOuter, ((fooId#4L = barId#14L) && (fooWindow#9 = barWindow#19))
> :- Project [fooId#4L, fooTime#5-T2000ms, window#10-T2000ms AS fooWindow#9]
> :  +- Filter isnotnull(fooTime#5-T2000ms)
> : +- Project [named_struct(start, precisetimestampconversion(CASE 
> WHEN (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, 
> TimestampType, LongType) - 0) as double) / cast(200 as double))) as 
> double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) THEN 
> (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) 
> ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) END + cast(0 as 
> bigint)) - cast(1 as bigint)) * 200) + 0), LongType, TimestampType), end, 
> precisetimestampconversion((CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, 
> TimestampType, LongType) - 0) as double) / cast(200 as double))) as 
> double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) THEN 
> (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) 
> ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) END + cast(0 as 
> bigint)) - cast(1 as bigint)) * 200) + 0) + 200), LongType, 
> TimestampType)) AS window#10-T2000ms, fooId#4L, fooTime#5-T2000ms]
> :+- EventTimeWatermark fooTime#5: timestamp, interval 2 seconds
> :   +- Project [value#1L

[jira] [Assigned] (SPARK-27340) Alias on TimeWIndow expression may cause watermark metadata lost



 [ 
https://issues.apache.org/jira/browse/SPARK-27340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27340:
-

Assignee: Yuanjian Li

> Alias on TimeWIndow expression may cause watermark metadata lost 
> -
>
> Key: SPARK-27340
> URL: https://issues.apache.org/jira/browse/SPARK-27340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Kevin Zhang
>Assignee: Yuanjian Li
>Priority: Major
>
> When we use data api to write a structured streaming query job we usually 
> specify a watermark on event time column. If we define a window on the event 
> time column, the delayKey metadata of the event time column is supposed to be 
> propagated to the new column generated by time window expression. But if we 
> add additional alias on the time window column, the delayKey metadata is lost.
> Currently I only find the bug will affect stream-stream join with equal 
> window join keys. In terms of aggregation, the gourping expression can be 
> trimed(in CleanupAliases rule) so additional alias are removed and the 
> metadata is kept.
> Here is an example:
> {code:scala}
>   val sparkSession = SparkSession
> .builder()
> .master("local")
> .getOrCreate()
>   val rateStream = sparkSession.readStream
> .format("rate")
> .option("rowsPerSecond", 10)
> .load()
> val fooStream = rateStream
>   .select(
> col("value").as("fooId"),
> col("timestamp").as("fooTime")
>   )
>   .withWatermark("fooTime", "2 seconds")
>   .select($"fooId", $"fooTime", window($"fooTime", "2 
> seconds").alias("fooWindow"))
> val barStream = rateStream
>   .where(col("value") % 2 === 0)
>   .select(
> col("value").as("barId"),
> col("timestamp").as("barTime")
>   )
>   .withWatermark("barTime", "2 seconds")
>   .select($"barId", $"barTime", window($"barTime", "2 
> seconds").alias("barWindow"))
> val joinedDf = fooStream
>   .join(
> barStream,
> $"fooId" === $"barId" &&
>   fooStream.col("fooWindow") === barStream.col("barWindow"),
> joinType = "LeftOuter"
>   )
>   val query = joinedDf
>   .writeStream
>   .format("console")
>   .option("truncate", 100)
>   .trigger(Trigger.ProcessingTime("5 seconds"))
>   .start()
> query.awaitTermination()
> {code}
> this program will end with an exception, and from the analyzed plan we can 
> see there is no delayKey metadata on 'fooWindow'
> {code:java}
> org.apache.spark.sql.AnalysisException: Stream-stream outer join between two 
> streaming DataFrame/Datasets is not supported without a watermark in the join 
> keys, or a watermark on the nullable side and an appropriate range condition;;
> Join LeftOuter, ((fooId#4L = barId#14L) && (fooWindow#9 = barWindow#19))
> :- Project [fooId#4L, fooTime#5-T2000ms, window#10-T2000ms AS fooWindow#9]
> :  +- Filter isnotnull(fooTime#5-T2000ms)
> : +- Project [named_struct(start, precisetimestampconversion(CASE 
> WHEN (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, 
> TimestampType, LongType) - 0) as double) / cast(200 as double))) as 
> double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) THEN 
> (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) 
> ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) END + cast(0 as 
> bigint)) - cast(1 as bigint)) * 200) + 0), LongType, TimestampType), end, 
> precisetimestampconversion((CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, 
> TimestampType, LongType) - 0) as double) / cast(200 as double))) as 
> double) = (cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) THEN 
> (CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) + cast(1 as bigint)) 
> ELSE CEIL((cast((precisetimestampconversion(fooTime#5-T2000ms, TimestampType, 
> LongType) - 0) as double) / cast(200 as double))) END + cast(0 as 
> bigint)) - cast(1 as bigint)) * 200) + 0) + 200), LongType, 
> TimestampType)) AS window#10-T2000ms, fooId#4L, fooTime#5-T2000ms]
> :+- EventTimeWatermark fooTime#5: timestamp, interval 2 seconds
> :   +- Project [value#1L AS fooId#4L, timestamp#0 AS fooTime#5]
> :  +- StreamingRelationV2 
>

[jira] [Created] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-27 Thread DB Tsai (Jira)

DB Tsai created SPARK-31582:
---

 Summary: Be able to not populate Hadoop classpath
 Key: SPARK-31582
 URL: https://issues.apache.org/jira/browse/SPARK-31582
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 2.4.5
Reporter: DB Tsai


Spark Yarn client will populate hadoop classpath from 
`yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
for Spark with embedded hadoop build, it will result jar conflicts because 
spark distribution can contain different version of hadoop jars.

We are adding a new Yarn configuration to not populate hadoop classpath from  
`yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26631) Issue while reading Parquet data from Hadoop Archive files (.har)

2020-04-27 Thread Tien Dat (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093781#comment-17093781
 ] 

Tien Dat commented on SPARK-26631:
--

Dear experts, this also happens on the Spark 2.4.0.

Could you please update on the issue, or at least the reason to not tackle it 
for a solution

> Issue while reading Parquet data from Hadoop Archive files (.har)
> -
>
> Key: SPARK-26631
> URL: https://issues.apache.org/jira/browse/SPARK-26631
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sathish
>Priority: Minor
>
> While reading Parquet file from Hadoop Archive file Spark is failing with 
> below exception
>  
> {code:java}
> scala> val hardf = 
> sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet") 
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
>    at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
>    at scala.Option.getOrElse(Option.scala:121)   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
>    at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
>    at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)  
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)   ... 
> 49 elided
> {code}
>  
> Whereas the same parquet file can be read normally without any issues
> {code:java}
> scala> val df = 
> sqlContext.read.parquet("hdfs:///tmp/testparquet/userdata1.parquet")
> df: org.apache.spark.sql.DataFrame = [registration_dttm: timestamp, id: int 
> ... 11 more fields]{code}
>  
> +Here are the steps to reproduce the issue+
>  
> a) hadoop fs -mkdir /tmp/testparquet
> b) Get sample parquet data and rename the file to userdata1.parquet
> wget 
> [https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true]
> c) hadoop fs -put userdata.parquet /tmp/testparquet
> d) hadoop archive -archiveName testarchive.har -p /tmp/testparquet /tmp
> e) We should be able to see the file under har file
> hadoop fs -ls har:///tmp/testarchive.har
> f) Launch spark2 / spark shell
> g)
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>     val df = 
> sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet"){code}
> is there anything which I am missing here.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31580) Upgrade Apache ORC to 1.5.10



 [ 
https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31580:
--
Issue Type: Bug  (was: Improvement)

> Upgrade Apache ORC to 1.5.10
> 
>
> Key: SPARK-31580
> URL: https://issues.apache.org/jira/browse/SPARK-31580
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31580) Upgrade Apache ORC to 1.5.10



 [ 
https://issues.apache.org/jira/browse/SPARK-31580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31580:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Upgrade Apache ORC to 1.5.10
> 
>
> Key: SPARK-31580
> URL: https://issues.apache.org/jira/browse/SPARK-31580
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31581) paste0 is always better than paste(sep="")

Michael Chirico created SPARK-31581:
---

 Summary: paste0 is always better than paste(sep="")
 Key: SPARK-31581
 URL: https://issues.apache.org/jira/browse/SPARK-31581
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 2.4.5
Reporter: Michael Chirico


paste0 is available in the stated R dependency (3.1), so we should use it 
instead of paste(sep="")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31580) Upgrade Apache ORC to 1.5.10

Dongjoon Hyun created SPARK-31580:
-

 Summary: Upgrade Apache ORC to 1.5.10
 Key: SPARK-31580
 URL: https://issues.apache.org/jira/browse/SPARK-31580
 Project: Spark
  Issue Type: Improvement
  Components: Build, SQL
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-04-27 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31579:
--

 Summary: Replace floorDiv by / in 
localRebaseGregorianToJulianDays()
 Key: SPARK-31579
 URL: https://issues.apache.org/jira/browse/SPARK-31579
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check that 
for all available time zones in the range of [0001, 2100] years with the step 
of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv can be 
replaced by /, and this should improve performance of 
RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31578) Internal checkSchemaInArrow is inefficient

Michael Chirico created SPARK-31578:
---

 Summary: Internal checkSchemaInArrow is inefficient
 Key: SPARK-31578
 URL: https://issues.apache.org/jira/browse/SPARK-31578
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 2.4.5
Reporter: Michael Chirico


Current implementation is doubly inefficient:

Repeatedly doing the same (95%) sapply loop
Doing scalar == on a vector (== should be done over the whole vector for 
efficiency)

See existing PR:

https://github.com/apache/spark/pull/28372



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31569) Add links to subsections in SQL Reference main page

2020-04-27 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31569.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28360

> Add links to subsections in SQL Reference main page
> ---
>
> Key: SPARK-31569
> URL: https://issues.apache.org/jira/browse/SPARK-31569
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add links to subsections in SQL Reference main page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31577) fix various problems when check name conflicts of CTE relations

Wenchen Fan created SPARK-31577:
---

 Summary: fix various problems when check name conflicts of CTE 
relations
 Key: SPARK-31577
 URL: https://issues.apache.org/jira/browse/SPARK-31577
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang updated SPARK-31576:
-
Description: 
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")                                                           
                                                         override def 
quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect)                                       
                                          val props = new Properties()          
                                                                                
                props.put("driver","org.apache.hive.jdbc.HiveDriver")           
                       props.put("user","username")                             
                            props.put("password","password")                    
                                       props.put("fetchsize","20")              
                                                                                
       val table=spark.read .jdbc("jdbc:hive2://:1","test",props)   
                       table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 

  was:
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")                                                           
                                                     override def 
quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect)   val props = new Properties()   
props.put("driver","org.apache.hive.jdbc.HiveDriver")   
props.put("user","username")   props.put("password","password")   
props.put("fetchsize","20")   val table=spark.read 
.jdbc("jdbc:hive2://:1","test",props)   table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 


> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key:

[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang updated SPARK-31576:
-
Description: 
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")                                                           
                                                     override def 
quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect)   val props = new Properties()   
props.put("driver","org.apache.hive.jdbc.HiveDriver")   
props.put("user","username")   props.put("password","password")   
props.put("fetchsize","20")   val table=spark.read 
.jdbc("jdbc:hive2://:1","test",props)   table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 

  was:
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")

 

override def quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

JdbcDialects.registerDialect(HiveDialect)

 

val props = new Properties()

 

props.put("driver","org.apache.hive.jdbc.HiveDriver")

 

props.put("user","username")

 

props.put("password","password")

 

props.put("fetchsize","20")

 

val table=spark.read .jdbc("jdbc:hive2://:1","test",props)

 

table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 


> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key: SPARK-31576
> URL: https://issues.apache.org/jira/browse/SPARK-31576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.3.1
> Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
>Reporter: liuzhang
>Priority: Major
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
> Unfortunately, when I try to query data that resides in every column I get 
> the following error:
> Caused by:

[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang updated SPARK-31576:
-
Description: 
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")

 

override def quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

JdbcDialects.registerDialect(HiveDialect)

 

val props = new Properties()

 

props.put("driver","org.apache.hive.jdbc.HiveDriver")

 

props.put("user","username")

 

props.put("password","password")

 

props.put("fetchsize","20")

 

val table=spark.read .jdbc("jdbc:hive2://:1","test",props)

 

table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 

  was:
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")

override def quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect) val props = new Properties() 
props.put("driver","org.apache.hive.jdbc.HiveDriver") 
props.put("user","username") props.put("password","password") 
props.put("fetchsize","20") val table=spark.read 
.jdbc("jdbc:hive2://:1","test",props) table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 


> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key: SPARK-31576
> URL: https://issues.apache.org/jira/browse/SPARK-31576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.3.1
> Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
>Reporter: liuzhang
>Priority: Major
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
> Unfortunately, when I try to query data that resides in every column I get 
> the following error:
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table

[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang updated SPARK-31576:
-
Description: 
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect

{ override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")

override def quoteIdentifier(colName: String): String = s"`$colName`" }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect) val props = new Properties() 
props.put("driver","org.apache.hive.jdbc.HiveDriver") 
props.put("user","username") props.put("password","password") 
props.put("fetchsize","20") val table=spark.read 
.jdbc("jdbc:hive2://:1","test",props) table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 

  was:
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect {
 override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")
 override def quoteIdentifier(colName: String): String = s"`$colName`"
 }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect)

val props = new Properties()

props.put("driver","org.apache.hive.jdbc.HiveDriver")

props.put("user","username")

props.put("password","password")

props.put("fetchsize","20")

val table=spark.read .jdbc("jdbc:hive2://:1","test",props)

table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 


> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key: SPARK-31576
> URL: https://issues.apache.org/jira/browse/SPARK-31576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.3.1
> Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
>Reporter: liuzhang
>Priority: Major
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
> Unfortunately, when I try to query data that resides in every column I get 
> the following error:
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table alias or column

[jira] [Updated] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED



 [ 
https://issues.apache.org/jira/browse/SPARK-31576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhang updated SPARK-31576:
-
Description: 
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect {
 override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")
 override def quoteIdentifier(colName: String): String = s"`$colName`"
 }

---

object callOffRun {
 def main(args: Array[String]): Unit =

{ val spark = SparkSession.builder().enableHiveSupport().getOrCreate() 
JdbcDialects.registerDialect(HiveDialect)

val props = new Properties()

props.put("driver","org.apache.hive.jdbc.HiveDriver")

props.put("user","username")

props.put("password","password")

props.put("fetchsize","20")

val table=spark.read .jdbc("jdbc:hive2://:1","test",props)

table.show() }

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) I try some method to print result,They all reported the same error

 

  was:
I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect {
 override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")
 override def quoteIdentifier(colName: String): String = s"`$colName`"
}

---

object callOffRun {
def main(args: Array[String]): Unit = {

val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

JdbcDialects.registerDialect(HiveDialect)
val props = new Properties()
props.put("driver","org.apache.hive.jdbc.HiveDriver")
props.put("user","username")
props.put("password","password")
props.put("fetchsize","20")

val table=spark.read

.jdbc("jdbc:hive2://:1","test",props)

table.show()

}

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) i try some method to print result,They all reported the same error

 


> Unable to return Hive data into Spark via Hive JDBC driver Caused by:  
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED
> 
>
> Key: SPARK-31576
> URL: https://issues.apache.org/jira/browse/SPARK-31576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.3.1
> Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
>Reporter: liuzhang
>Priority: Major
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
> Unfortunately, when I try to query data that resides in every column I get 
> the following error:
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException [Error 10004]: Line 1:7 
> Invalid table alias or column

[jira] [Created] (SPARK-31576) Unable to return Hive data into Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

liuzhang created SPARK-31576:


 Summary: Unable to return Hive data into Spark via Hive JDBC 
driver Caused by:  org.apache.hive.service.cli.HiveSQLException: Error while 
compiling statement: FAILED
 Key: SPARK-31576
 URL: https://issues.apache.org/jira/browse/SPARK-31576
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Spark Submit
Affects Versions: 2.3.1
 Environment: hdp 3.0,hadoop 3.1.1，spark 2.3.1
Reporter: liuzhang


I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. 
Unfortunately, when I try to query data that resides in every column I get the 
following error:

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)
 at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:335)
 at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:199)

1)  On Hive create a simple table,its name is "test"，it have three 
column(aname,score,banji),their type both are "String"

2)important code:

object HiveDialect extends JdbcDialect {
 override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2")|| 
url.contains("hive2")
 override def quoteIdentifier(colName: String): String = s"`$colName`"
}

---

object callOffRun {
def main(args: Array[String]): Unit = {

val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

JdbcDialects.registerDialect(HiveDialect)
val props = new Properties()
props.put("driver","org.apache.hive.jdbc.HiveDriver")
props.put("user","username")
props.put("password","password")
props.put("fetchsize","20")

val table=spark.read

.jdbc("jdbc:hive2://:1","test",props)

table.show()

}

}

3)spark-submit ,After running,it have error

Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling 
statement: FAILED: SemanticException [Error 10004]: Line 1:7 Invalid table 
alias or column reference 'test.aname': (possible column names are: aname, 
score, banji)

4)table.count() have result 

5) i try some method to print result,They all reported the same error

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode

2020-04-27 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093462#comment-17093462
 ] 

Kent Yao edited comment on SPARK-31527 at 4/27/20, 12:44 PM:
-

Added a followup PR for benchmark [https://github.com/apache/spark/pull/28369]


was (Author: qin yao):
Add a followup PR for benchmark [https://github.com/apache/spark/pull/28369]

> date add/subtract interval only allow those day precision in ansi mode
> --
>
> Key: SPARK-31527
> URL: https://issues.apache.org/jira/browse/SPARK-31527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Under ANSI mode, we should not allow date add interval with hours, minutes... 
> microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31527) date add/subtract interval only allow those day precision in ansi mode

2020-04-27 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093462#comment-17093462
 ] 

Kent Yao commented on SPARK-31527:
--

Add a followup PR for benchmark [https://github.com/apache/spark/pull/28369]

> date add/subtract interval only allow those day precision in ansi mode
> --
>
> Key: SPARK-31527
> URL: https://issues.apache.org/jira/browse/SPARK-31527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Under ANSI mode, we should not allow date add interval with hours, minutes... 
> microseconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31575) Synchronise global JVM security configuration modification

2020-04-27 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-31575:
-

 Summary: Synchronise global JVM security configuration modification
 Key: SPARK-31575
 URL: https://issues.apache.org/jira/browse/SPARK-31575
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31529) Remove extra whitespaces in the formatted explain



 [ 
https://issues.apache.org/jira/browse/SPARK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31529.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28315
[https://github.com/apache/spark/pull/28315]

> Remove extra whitespaces in the formatted explain
> -
>
> Key: SPARK-31529
> URL: https://issues.apache.org/jira/browse/SPARK-31529
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> The formatted explain included extra whitespaces. And even the number of 
> spaces are different between master and branch-3.0, which leads to failed 
> explain tests if we backport to branch-3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31529) Remove extra whitespaces in the formatted explain



 [ 
https://issues.apache.org/jira/browse/SPARK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31529:
---

Assignee: wuyi

> Remove extra whitespaces in the formatted explain
> -
>
> Key: SPARK-31529
> URL: https://issues.apache.org/jira/browse/SPARK-31529
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> The formatted explain included extra whitespaces. And even the number of 
> spaces are different between master and branch-3.0, which leads to failed 
> explain tests if we backport to branch-3.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2020-04-27 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

胡振宇 updated SPARK-14850:

Comment: was deleted

(was: /*code is for spark 1.6.1*/ 
object Example{
def main (args:Array[String]){

val conf = new SparkConf.setAppname("Example")
val sc=new sparkContext(conf)
val sqlContext=new SQLContext(sc) 
import sqlContext.implicts._ 
val count=sqlContext.sparkContext.parallelize(0,until 1e4.toInt,1).map{
i=>(i,Vectors.dense(Array.fill(1e6.toInt)(1.0)))
 }.toDF().rdd.count() //at this step toDF can be used on 
Spark1.6.1
}
} 

so I am not able to  test the simple serialization example 
)

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing

2020-04-27 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

胡振宇 updated SPARK-14850:

Comment: was deleted

(was: I try to run your code on spark1.6.1 but i found that "toDF" cannot be 
used in this example
Here are my code 
object Example{
def main (args:Array[String]){
  case class Test(num:Int,vector:Vector)
  val conf = new SparkConf.setAppname("Example")
  val sqlContext=new SQLContext(sc)
  import sqlContext.implicts._
  val temp=sqlContext.sparkContext.parallelize(0,until 
1e4.toInt,1).map(i=>Test(i,Vectors.dense(Array.fill(1e6.toInt)(1.0.toDF() 
//at this step toDF can be used I do
 }
}



sc.parallelize(0 until 1e4.toInt, 1).map { i =>
  (i, Vectors.dense(Array.fill(1e6.toInt)(1.0)))
}.toDF.rdd.count()

I even use sparkcontext but toDF cannot be used too

Do you have a solution to run the example on spark1.6.1? Thank you 

} )

> VectorUDT/MatrixUDT should take primitive arrays without boxing
> ---
>
> Key: SPARK-14850
> URL: https://issues.apache.org/jira/browse/SPARK-14850
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> In SPARK-9390, we switched to use GenericArrayData to store indices and 
> values in vector/matrix UDTs. However, GenericArrayData is not specialized 
> for primitive types. This might hurt MLlib performance badly. We should 
> consider either specialize GenericArrayData or use a different container.
> cc: [~cloud_fan] [~yhuai]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2020-04-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-17604:
-
Affects Version/s: 3.1.0
   Labels:   (was: bulk-closed)
 Priority: Major  (was: Minor)

> Support purging aged file entry for FileStreamSource metadata log
> -
>
> Key: SPARK-17604
> URL: https://issues.apache.org/jira/browse/SPARK-17604
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Saisai Shao
>Priority: Major
>
> Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
> periodically (10 batches by default), this means compacted batch file will 
> contain whole file entries been processed. With the time passed, the 
> compacted batch file will be accumulated to a relative large file. 
> With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
> but in the log we still keep the full records,  this is not necessary and 
> quite time-consuming during recovery. So here propose to also add file entry 
> purging ability to {{FileStreamSource}} metadata log.
> This is pending on SPARK-15698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc



 [ 
https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31550:


Assignee: Kent Yao

> nondeterministic configurations with general meanings in sql configuration doc
> --
>
> Key: SPARK-31550
> URL: https://issues.apache.org/jira/browse/SPARK-31550
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> spark.sql.session.timeZone
> spark.sql.warehouse.dir
>  
> these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17604) Support purging aged file entry for FileStreamSource metadata log

2020-04-27 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reopened SPARK-17604:
--

Reopening this, as end user reported this in user mailing list recently.

https://lists.apache.org/thread.html/r897771f5526d10d0b13da9177a6b7d2e37b22823c839cceea457%40%3Cuser.spark.apache.org%3E

> Support purging aged file entry for FileStreamSource metadata log
> -
>
> Key: SPARK-17604
> URL: https://issues.apache.org/jira/browse/SPARK-17604
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Saisai Shao
>Priority: Minor
>  Labels: bulk-closed
>
> Currently with SPARK-15698, FileStreamSource metadata log will be compacted 
> periodically (10 batches by default), this means compacted batch file will 
> contain whole file entries been processed. With the time passed, the 
> compacted batch file will be accumulated to a relative large file. 
> With SPARK-17165, now {{FileStreamSource}} doesn't track the aged file entry, 
> but in the log we still keep the full records,  this is not necessary and 
> quite time-consuming during recovery. So here propose to also add file entry 
> purging ability to {{FileStreamSource}} metadata log.
> This is pending on SPARK-15698.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31550) nondeterministic configurations with general meanings in sql configuration doc



 [ 
https://issues.apache.org/jira/browse/SPARK-31550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31550.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28322
[https://github.com/apache/spark/pull/28322]

> nondeterministic configurations with general meanings in sql configuration doc
> --
>
> Key: SPARK-31550
> URL: https://issues.apache.org/jira/browse/SPARK-31550
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> spark.sql.session.timeZone
> spark.sql.warehouse.dir
>  
> these 2 configs are nondeterministic and vary with environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31574) Schema evolution in spark while using the storage format as parquet

2020-04-27 Thread sharad Gupta (Jira)

sharad Gupta created SPARK-31574:

Summary: Schema evolution in spark while using the storage format
as parquet
Key: SPARK-31574
URL: https://issues.apache.org/jira/browse/SPARK-31574
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.3.0
Reporter: sharad Gupta

Hi Team,

Use case:

Suppose there is a table T1 with column C1 with datatype as int in schema
version 1. In the first on boarding table T1. I wrote couple of parquet files
with this schema version 1 with underlying file format used parquet.

Now in schema version 2 the C1 column datatype changed to string from int. Now
It will write data with schema version 2 in parquet.

So some parquet files are written with schema version 1 and some written with
schema version 2.

Problem statement :

1. We are not able to execute the below command from spark sql
```Alter table Table T1 change C1 C1 string```

2. So as a solution i goto hive and alter the table change datatype because it
supported in hive then try to read the data in spark. So it is giving me error

```

Caused by: java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)

at
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)

at
org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)

at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)

at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:109)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)```

3. Suspecting that the underlying parquet file is written with integer type and
we are reading from a table whose column is changed to string type. So that is
why it is happening.

How you can reproduce this:
spark sql
1. Create a table from spark sql with one column with datatype as int with
stored as parquet.
2. Now put some data into table.
3. Now you can see the data if you select from table.

Hive
1. change datatype from int to string by alter command
2. Now try to read data, You will be able to read the data here even after
changing the datatype.

spark sql
1. Try to read data from here now you will see the error.

Now the question is how to solve schema evolution in spark while using the
storage format as parquet.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31568) R: gapply documentation could be clearer about what the func argument is



 [ 
https://issues.apache.org/jira/browse/SPARK-31568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31568.
--
Fix Version/s: 3.0.0
   2.4.6
 Assignee: Michael Chirico
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28350

> R: gapply documentation could be clearer about what the func argument is
> 
>
> Key: SPARK-31568
> URL: https://issues.apache.org/jira/browse/SPARK-31568
> Project: Spark
>  Issue Type: Documentation
>  Components: R
>Affects Versions: 2.4.5
>Reporter: Michael Chirico
>Assignee: Michael Chirico
>Priority: Minor
> Fix For: 2.4.6, 3.0.0
>
>
> copied from pre-existing GH PR:
> https://github.com/apache/spark/pull/28350
> Spent a long time this weekend trying to figure out just what exactly key is 
> in gapply's func. I had assumed it would be a named list, but apparently not 
> -- the examples are working because schema is applying the name and the names 
> of the output data.frame don't matter.
> As near as I can tell the description I've added is correct, namely, that key 
> is an unnamed list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-04-27 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093151#comment-17093151
 ] 

Gabor Somogyi commented on SPARK-12312:
---

Not sure what you mean renewal. Renewal of what?
 * When keytab is invalid because the password has changed then the same must 
happen just like all other use-cases (re-distribute the keytab file which will 
be picked up properly when new connection is initiated)
 * When TGT is the question the solution re-obtains TGT automatically each and 
every time when new connection is created

Other thing which can be renewed I'm not aware of.

 

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31573) Use fixed=TRUE where possible for internal efficiency

Michael Chirico created SPARK-31573:
---

 Summary: Use fixed=TRUE where possible for internal efficiency
 Key: SPARK-31573
 URL: https://issues.apache.org/jira/browse/SPARK-31573
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 2.4.5
Reporter: Michael Chirico


gsub('_', '', x) is more efficient if we signal there's no regex: gsub('_', '', 
x, fixed = TRUE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31448) Difference in Storage Levels used in cache() and persist() for pyspark dataframes

2020-04-27 Thread Abhishek Dixit (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093079#comment-17093079
 ] 

Abhishek Dixit commented on SPARK-31448:


Any update on this?

> Difference in Storage Levels used in cache() and persist() for pyspark 
> dataframes
> -
>
> Key: SPARK-31448
> URL: https://issues.apache.org/jira/browse/SPARK-31448
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Abhishek Dixit
>Priority: Major
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark 
> and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>  
> *Problem Description:* 
> Calling *df.cache()*  for pyspark dataframe directly invokes Scala method 
> cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the 
> newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and 
> then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of 
> calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and 
> the possible fix is the correct approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31572) Improve task logs at executor side

2020-04-27 Thread wuyi (Jira)

wuyi created SPARK-31572:


 Summary: Improve task logs at executor side
 Key: SPARK-31572
 URL: https://issues.apache.org/jira/browse/SPARK-31572
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: wuyi


In some places, task names between driver and executor have different format, 
which brings extra difficulty for user to debug task level slowness. And we can 
add more logs to help the debug purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from

2020-04-27 Thread philipse (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093055#comment-17093055
 ] 

philipse commented on SPARK-24194:
--

Hi 

is the issue closed ? can i try it in product env?

 

Thanks

> HadoopFsRelation cannot overwrite a path that is also being read from
> -
>
> Key: SPARK-24194
> URL: https://issues.apache.org/jira/browse/SPARK-24194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: spark master
>Reporter: yangz
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When
> {code:java}
> INSERT OVERWRITE TABLE territory_count_compare select * from 
> territory_count_compare where shop_count!=real_shop_count
> {code}
> And territory_count_compare is a table with parquet, there will be a error 
> Cannot overwrite a path that is also being read from
>  
> And in file MetastoreDataSourceSuite.scala, there have a test case
>  
>  
> {code:java}
> table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)
> {code}
>  
> But when the table territory_count_compare is a common hive table, there is 
> no error. 
> So I think the reason is when insert overwrite into hadoopfs relation with 
> static partition, it first delete the partition in the output. But it should 
> be the time when the job commited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31571) don't use stop(paste to build R errors

Michael Chirico created SPARK-31571:
---

 Summary: don't use stop(paste to build R errors
 Key: SPARK-31571
 URL: https://issues.apache.org/jira/browse/SPARK-31571
 Project: Spark
  Issue Type: Documentation
  Components: R
Affects Versions: 2.4.5
Reporter: Michael Chirico


I notice for example this:

stop(paste0("Arrow optimization does not support 'dapplyCollect' yet. Please 
disable ",
"Arrow optimization or use 'collect' and 'dapply' APIs instead."))

paste0 is totally unnecessary here -- stop itself uses ... (vararg) with 
default ''-sep combination, i.e., the above is equivalent to:

stop("Arrow optimization does not support 'dapplyCollect' yet. Please disable ",
  "Arrow optimization or use 'collect' and 'dapply' APIs instead.")

More generally, for portability, this will make it more difficult for 
user-contributed translations because the standard set of tools for doing this 
(namely tools::update_pkg_po('.")) would fail to capture these messages as 
being candidates for translation.

In fact, it's completely preferable IMO to keep the entire stop("") message as 
a single string -- I've found that breaking the string across multiple lines 
makes translation across different languages with different grammars quite 
difficult. Understand there are lint style constraints however so I wouldn't 
press on that for now.

If formatting is needed, I recommend using stop(gettextf(...)) instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31485) Barrier stage can hang if only partial tasks launched



 [ 
https://issues.apache.org/jira/browse/SPARK-31485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31485:
---

Assignee: wuyi

> Barrier stage can hang if only partial tasks launched
> -
>
> Key: SPARK-31485
> URL: https://issues.apache.org/jira/browse/SPARK-31485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> The issue can be reproduced by following test:
>  
> {code:java}
> initLocalClusterSparkContext(2)
> val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2)
> val dep = new OneToOneDependency[Int](rdd0)
> val rdd = new MyRDD(sc, 2, List(dep), 
> Seq(Seq("executor_h_0"),Seq("executor_h_0")))
> rdd.barrier().mapPartitions { iter =>
>   BarrierTaskContext.get().barrier()
>   iter
> }.collect()
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31485) Barrier stage can hang if only partial tasks launched