date:20201009

[jira] [Assigned] (SPARK-33108) Remove sbt-dependency-graph SBT plugin

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33108:
-

Assignee: Dongjoon Hyun

> Remove sbt-dependency-graph SBT plugin
> --
>
> Key: SPARK-33108
> URL: https://issues.apache.org/jira/browse/SPARK-33108
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33108) Remove sbt-dependency-graph SBT plugin

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33108.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29997
[https://github.com/apache/spark/pull/29997]

> Remove sbt-dependency-graph SBT plugin
> --
>
> Key: SPARK-33108
> URL: https://issues.apache.org/jira/browse/SPARK-33108
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33104) Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil`

2020-10-09 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211563#comment-17211563
 ] 

Yang Jie commented on SPARK-33104:
--

Is this an inevitable problem? `mvn test` can pass, may need to add some log to 
determine the file loading path of `core-site.xml`

> Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in 
> SparkHadoopUtil`
> 
>
> Key: SPARK-33104
> URL: https://issues.apache.org/jira/browse/SPARK-33104
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1377/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/yarn_cluster_should_respect_conf_overrides_in_SparkHadoopUtil__SPARK_16414__SPARK_23630_/
> {code}
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exit code from container container_1602245728426_0006_02_01 is : 15
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exception from container-launch with container ID: 
> container_1602245728426_0006_02_01 and exit code: 15
> ExitCodeException exitCode=15: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>   at org.apache.hadoop.util.Shell.run(Shell.java:482)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN ContainerLaunch: Container 
> exited with a non-zero exit code 15
> 20/10/09 05:18:13.237 AsyncDispatcher event handler WARN NMAuditLogger: 
> USER=jenkins  OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1602245728426_0006
> CONTAINERID=container_1602245728426_0006_02_01
> 20/10/09 05:18:13.244 Socket Reader #1 for port 37112 INFO Server: Auth 
> successful for appattempt_1602245728426_0006_02 (auth:SIMPLE)
> 20/10/09 05:18:13.326 IPC Parameter Sending Thread #0 DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins sending #37
> 20/10/09 05:18:13.327 IPC Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins got value #37
> 20/10/09 05:18:13.328 main DEBUG ProtobufRpcEngine: Call: 
> getApplicationReport took 2ms
> 20/10/09 05:18:13.328 main INFO Client: Application report for 
> application_1602245728426_0006 (state: FINISHED)
> 20/10/09 05:18:13.328 main DEBUG Client: 
>client token: N/A
>diagnostics: User class threw exception: 
> org.scalatest.exceptions.TestFailedException: null was not equal to 
> "testvalue"
>   at 
> org.scalatest.matchers.MatchersHelper$.indicateFailure(MatchersHelper.scala:344)
>   at 
> org.scalatest.matchers.should.Matchers$ShouldMethodHelperClass.shouldMatcher(Matchers.scala:6778)
>   at 
> org.scalatest.matchers.should.Matchers$AnyShouldWrapper.should(Matchers.scala:6822)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.$anonfun$main$2(YarnClusterSuite.scala:383)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.main(YarnClusterSuite.scala:382)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf.main(YarnClusterSuite.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMe

[jira] [Created] (SPARK-33109) Upgrade to SBT 1.4 and support `dependencyTree` back

2020-10-09 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33109:
-

 Summary: Upgrade to SBT 1.4 and support `dependencyTree` back
 Key: SPARK-33109
 URL: https://issues.apache.org/jira/browse/SPARK-33109
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33094:
-
Fix Version/s: 2.4.8

> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33079) Replace the existing Maven job for Scala 2.13 in Github Actions with SBT job

2020-10-09 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-33079:
---
Summary: Replace the existing Maven job for Scala 2.13 in Github Actions 
with SBT job  (was: Add Scala 2.13 build test in GitHub Action for SBT)

> Replace the existing Maven job for Scala 2.13 in Github Actions with SBT job
> 
>
> Key: SPARK-33079
> URL: https://issues.apache.org/jira/browse/SPARK-33079
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> SPARK-32926 added a build test to GitHub Action for Scala 2.13 but it's only 
> with Maven.
> As SPARK-32873 reported, some compilation error happens only with SBT so I 
> think we need to add another build test to GitHub Action for SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32082) Project Zen: Improving Python usability

2020-10-09 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211560#comment-17211560
 ] 

Hyukjin Kwon commented on SPARK-32082:
--

Nope, these are all the JIRAs linked here. I should still collect feedback and 
investigate with a proper design for that. Feel free to send an email (with 
cc'ing me) or file a JIRA if you have a concrete idea.

> Project Zen: Improving Python usability
> ---
>
> Key: SPARK-32082
> URL: https://issues.apache.org/jira/browse/SPARK-32082
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> The importance of Python and PySpark has grown radically in the last few 
> years. The number of PySpark downloads reached [more than 1.3 million _every 
> week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
> PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
> messages as an example, and the API documentation is poorly written.
> This epic tickets aims to improve the usability in PySpark, and make it more 
> Pythonic. To be more explicit, this JIRA targets four bullet points below. 
> Each includes examples:
>  * Being Pythonic
>  ** Pandas UDF enhancements and type hints
>  ** Avoid dynamic function definitions, for example, at {{funcitons.py}} 
> which makes IDEs unable to detect.
>  * Better and easier usability in PySpark
>  ** User-facing error message and warnings
>  ** Documentation
>  ** User guide
>  ** Better examples and API documentation, e.g. 
> [Koalas|https://koalas.readthedocs.io/en/latest/] and 
> [pandas|https://pandas.pydata.org/docs/]
>  * Better interoperability with other Python libraries
>  ** Visualization and plotting
>  ** Potentially better interface by leveraging Arrow
>  ** Compatibility with other libraries such as NumPy universal functions or 
> pandas possibly by leveraging Koalas
>  * PyPI Installation
>  ** PySpark with Hadoop 3 support on PyPi
>  ** Better error handling



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32082) Project Zen: Improving Python usability

2020-10-09 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211560#comment-17211560
 ] 

Hyukjin Kwon edited comment on SPARK-32082 at 10/10/20, 4:55 AM:
-

Nope, these are all the JIRAs linked here. I should still collect feedback and 
investigate with a proper design for that. Feel free to send an email to dev 
mailing list (with cc'ing me) or file a JIRA if you have a concrete idea.


was (Author: hyukjin.kwon):
Nope, these are all the JIRAs linked here. I should still collect feedback and 
investigate with a proper design for that. Feel free to send an email (with 
cc'ing me) or file a JIRA if you have a concrete idea.

> Project Zen: Improving Python usability
> ---
>
> Key: SPARK-32082
> URL: https://issues.apache.org/jira/browse/SPARK-32082
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> The importance of Python and PySpark has grown radically in the last few 
> years. The number of PySpark downloads reached [more than 1.3 million _every 
> week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
> PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
> messages as an example, and the API documentation is poorly written.
> This epic tickets aims to improve the usability in PySpark, and make it more 
> Pythonic. To be more explicit, this JIRA targets four bullet points below. 
> Each includes examples:
>  * Being Pythonic
>  ** Pandas UDF enhancements and type hints
>  ** Avoid dynamic function definitions, for example, at {{funcitons.py}} 
> which makes IDEs unable to detect.
>  * Better and easier usability in PySpark
>  ** User-facing error message and warnings
>  ** Documentation
>  ** User guide
>  ** Better examples and API documentation, e.g. 
> [Koalas|https://koalas.readthedocs.io/en/latest/] and 
> [pandas|https://pandas.pydata.org/docs/]
>  * Better interoperability with other Python libraries
>  ** Visualization and plotting
>  ** Potentially better interface by leveraging Arrow
>  ** Compatibility with other libraries such as NumPy universal functions or 
> pandas possibly by leveraging Koalas
>  * PyPI Installation
>  ** PySpark with Hadoop 3 support on PyPi
>  ** Better error handling



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33102:


Assignee: Gabor Somogyi

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33102.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29989
[https://github.com/apache/spark/pull/29989]

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33105) Broken installation of source packages on AppVeyor

2020-10-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33105.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/29991

> Broken installation of source packages on AppVeyor
> --
>
> Key: SPARK-33105
> URL: https://issues.apache.org/jira/browse/SPARK-33105
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 3.1.0
> Environment: *strong text*
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> It looks like AppVeyor configuration is broken, which leads to failure of 
> installation of  source packages (become a problem when {{rlang}} has been 
> updated from 0.4.7 and 0.4.8, with latter available only as a source package).
> {code}
> [00:01:48] trying URL
> 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
> [00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
> [00:01:48] ==
> [00:01:48] downloaded 827 KB
> [00:01:48] 
> [00:01:48] Warning in strptime(xx, f, tz = tz) :
> [00:01:48]   unable to identify current timezone 'C':
> [00:01:48] please set environment variable 'TZ'
> [00:01:49] * installing *source* package 'rlang' ...
> [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
> [00:01:49] ** using staged installation
> [00:01:49] ** libs
> [00:01:49] 
> [00:01:49] *** arch - i386
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c capture.c -o capture.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c export.c -o export.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c internal.c -o internal.o
> [00:01:50] In file included from ./lib/rlang.h:74,
> [00:01:50]  from internal/arg.c:1,
> [00:01:50]  from internal.c:1:
> [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
> [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> [00:01:50]return ENCLOS(env);
> [00:01:50]   ^~~
> [00:01:50] In file included from internal.c:8:
> [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
> [00:01:50]sexp* top;
> [00:01:50]  ^~~
> [00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c lib.c -o lib.o
> [00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c version.c -o version.o
> [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
> rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
> -LC:/R/bin/i386 -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> cannot find -lR
> [00:01:52] collect2.exe: error: ld returned 1 exit status
> [00:01:52] no DLL was created
> [00:01:52] ERROR: compilation failed for package 'rlang'
> [00:01:52] * removing 'C:/RLibrary/rlang'
> [00:01:52] 
> [00:01:52] The downloaded source packages are in
> [00:01:52]
> 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
> [00:01:52] Warning message:
> [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
> "e1071",  :
> [00:01:52]   installation of package 'rlang' had non-zero exit status 
> {code}
> This leads to failures to install {{devtools}} and generate Rd files and, as 
> a result, CRAN check failure.
> There are some discrepancies in the 
> {{dev/appveyor-install-dependencies.ps1}}, but the direct source of this 
> issue seems to be {{$env:BINPREF}}, which forces usage of 64 bit mingw, even 
> if packages are compiled for 32 bit. 
> Modifying the variable to include current architecture:
> {code}
> $env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
> {code}
> (as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks 
> like a valid fix, though we might want to clean remaining issues as well.



--
This message was s

[jira] [Commented] (SPARK-32907) adaptively blockify instances

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211550#comment-17211550
 ] 

Apache Spark commented on SPARK-32907:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29998

> adaptively blockify instances
> -
>
> Key: SPARK-32907
> URL: https://issues.apache.org/jira/browse/SPARK-32907
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: blockify_svc_perf_20201010.xlsx
>
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it is reasonable to control the size of block.
>  
> I had some offline discuss with [~weichenxu123], then we think following 
> changes are worthy：
> 1, infer an appropriate blockSize (MB) based on numFeatures and nnz by 
> default;
> 2, impls should use a relative small memory footprint when processing one 
> block, and should not use a large pre-allocated buffer, so we need to revert 
> gmm;
> 3, use new blockify strategy in LinearSVC/LoR/LiR/AFT;
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32907) adaptively blockify instances

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211549#comment-17211549
 ] 

Apache Spark commented on SPARK-32907:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29998

> adaptively blockify instances
> -
>
> Key: SPARK-32907
> URL: https://issues.apache.org/jira/browse/SPARK-32907
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: blockify_svc_perf_20201010.xlsx
>
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it is reasonable to control the size of block.
>  
> I had some offline discuss with [~weichenxu123], then we think following 
> changes are worthy：
> 1, infer an appropriate blockSize (MB) based on numFeatures and nnz by 
> default;
> 2, impls should use a relative small memory footprint when processing one 
> block, and should not use a large pre-allocated buffer, so we need to revert 
> gmm;
> 3, use new blockify strategy in LinearSVC/LoR/LiR/AFT;
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33108) Remove sbt-dependency-graph SBT plugin

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211547#comment-17211547
 ] 

Apache Spark commented on SPARK-33108:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29997

> Remove sbt-dependency-graph SBT plugin
> --
>
> Key: SPARK-33108
> URL: https://issues.apache.org/jira/browse/SPARK-33108
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33108) Remove sbt-dependency-graph SBT plugin

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33108:


Assignee: Apache Spark

> Remove sbt-dependency-graph SBT plugin
> --
>
> Key: SPARK-33108
> URL: https://issues.apache.org/jira/browse/SPARK-33108
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33108) Remove sbt-dependency-graph SBT plugin

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33108:


Assignee: (was: Apache Spark)

> Remove sbt-dependency-graph SBT plugin
> --
>
> Key: SPARK-33108
> URL: https://issues.apache.org/jira/browse/SPARK-33108
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32907) adaptively blockify instances

2020-10-09 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32907:
-
Attachment: blockify_svc_perf_20201010.xlsx

> adaptively blockify instances
> -
>
> Key: SPARK-32907
> URL: https://issues.apache.org/jira/browse/SPARK-32907
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
> Attachments: blockify_svc_perf_20201010.xlsx
>
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it is reasonable to control the size of block.
>  
> I had some offline discuss with [~weichenxu123], then we think following 
> changes are worthy：
> 1, infer an appropriate blockSize (MB) based on numFeatures and nnz by 
> default;
> 2, impls should use a relative small memory footprint when processing one 
> block, and should not use a large pre-allocated buffer, so we need to revert 
> gmm;
> 3, use new blockify strategy in LinearSVC/LoR/LiR/AFT;
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33108) Remove sbt-dependency-graph SBT plugin

2020-10-09 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33108:
-

 Summary: Remove sbt-dependency-graph SBT plugin
 Key: SPARK-33108
 URL: https://issues.apache.org/jira/browse/SPARK-33108
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33107) Remove hive-2.3 workaround code

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33107:


Assignee: (was: Apache Spark)

> Remove hive-2.3 workaround code
> ---
>
> Key: SPARK-33107
> URL: https://issues.apache.org/jira/browse/SPARK-33107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can make code more clear and readable after SPARK-33082.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33107) Remove hive-2.3 workaround code

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33107:


Assignee: Apache Spark

> Remove hive-2.3 workaround code
> ---
>
> Key: SPARK-33107
> URL: https://issues.apache.org/jira/browse/SPARK-33107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> We can make code more clear and readable after SPARK-33082.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33107) Remove hive-2.3 workaround code

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211526#comment-17211526
 ] 

Apache Spark commented on SPARK-33107:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29996

> Remove hive-2.3 workaround code
> ---
>
> Key: SPARK-33107
> URL: https://issues.apache.org/jira/browse/SPARK-33107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can make code more clear and readable after SPARK-33082.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33107) Remove hive-2.3 workaround code

2020-10-09 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33107:
---

 Summary: Remove hive-2.3 workaround code
 Key: SPARK-33107
 URL: https://issues.apache.org/jira/browse/SPARK-33107
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


We can make code more clear and readable after SPARK-33082.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33045) Implement built-in LIKE ANY and LIKE ALL UDF

2020-10-09 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211522#comment-17211522
 ] 

jiaan.geng commented on SPARK-33045:


I'm working on.

> Implement built-in LIKE ANY and LIKE ALL UDF
> 
>
> Key: SPARK-33045
> URL: https://issues.apache.org/jira/browse/SPARK-33045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> We already support LIKE ANY / SOME / ALL syntax, but it will throw 
> {{StackOverflowError}} if there are many elements(more than 14378 elements). 
> We should implement built-in LIKE ANY and LIKE ALL UDF to fix this issue.
> {noformat}
> java.lang.StackOverflowError
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
>   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>   at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:184)
>   at 
> scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:47)
>   at 
> scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:53)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.children(Expression.scala:549)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1(TreeNode.scala:175)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreach$1$adapted(TreeNode.scala:175)
>   at scala.collection.immutable.List.foreach(List.scala:392)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33080) Replace compiler reporter with more robust and maintainable solution

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33080:


Assignee: Apache Spark

> Replace compiler reporter with more robust and maintainable solution
> 
>
> Key: SPARK-33080
> URL: https://issues.apache.org/jira/browse/SPARK-33080
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Apache Spark
>Priority: Minor
>
> Currently existing solution to have build to fail on any warning except 
> deprecation ones 
> ([https://github.com/apache/spark/blob/v3.0.1/project/SparkBuild.scala#L285)] 
> is not very maintainable in scope of build upgrade to latest sbt. 
> At upgrade to sbt 1.4.0 this snippet would fail build import at all.
> Implement new solution, using switch over compiler versions, silencer 
> compiler plugin for Scala prior 2.13.2 and built-in warning configuration 
> since Scala 2.13.2
> Depends on changes for SPARK-21708



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33080) Replace compiler reporter with more robust and maintainable solution

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33080:


Assignee: (was: Apache Spark)

> Replace compiler reporter with more robust and maintainable solution
> 
>
> Key: SPARK-33080
> URL: https://issues.apache.org/jira/browse/SPARK-33080
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> Currently existing solution to have build to fail on any warning except 
> deprecation ones 
> ([https://github.com/apache/spark/blob/v3.0.1/project/SparkBuild.scala#L285)] 
> is not very maintainable in scope of build upgrade to latest sbt. 
> At upgrade to sbt 1.4.0 this snippet would fail build import at all.
> Implement new solution, using switch over compiler versions, silencer 
> compiler plugin for Scala prior 2.13.2 and built-in warning configuration 
> since Scala 2.13.2
> Depends on changes for SPARK-21708



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33080) Replace compiler reporter with more robust and maintainable solution

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211494#comment-17211494
 ] 

Apache Spark commented on SPARK-33080:
--

User 'gemelen' has created a pull request for this issue:
https://github.com/apache/spark/pull/29995

> Replace compiler reporter with more robust and maintainable solution
> 
>
> Key: SPARK-33080
> URL: https://issues.apache.org/jira/browse/SPARK-33080
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> Currently existing solution to have build to fail on any warning except 
> deprecation ones 
> ([https://github.com/apache/spark/blob/v3.0.1/project/SparkBuild.scala#L285)] 
> is not very maintainable in scope of build upgrade to latest sbt. 
> At upgrade to sbt 1.4.0 this snippet would fail build import at all.
> Implement new solution, using switch over compiler versions, silencer 
> compiler plugin for Scala prior 2.13.2 and built-in warning configuration 
> since Scala 2.13.2
> Depends on changes for SPARK-21708



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33106) Fix sbt resolvers clash

2020-10-09 Thread Denis Pyshev (Jira)

Denis Pyshev created SPARK-33106:


 Summary: Fix sbt resolvers clash
 Key: SPARK-33106
 URL: https://issues.apache.org/jira/browse/SPARK-33106
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Denis Pyshev


During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.

That leads to local resolvers name clashing, which is observed as warning from 
SBT:


{code:java}
[warn] Multiple resolvers having different access mechanism configured with 
same name 'local'. To avoid conflict, Remove duplicate project resolvers 
(`resolvers`) or rename publishing resolve
r (`publishTo`).
{code}
This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33104) Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil`

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33104:
--
Summary: Fix `YarnClusterSuite.yarn-cluster should respect conf overrides 
in SparkHadoopUtil`  (was: Fix YarnClusterSuite.yarn-cluster should respect 
conf overrides in SparkHadoopUtil)

> Fix `YarnClusterSuite.yarn-cluster should respect conf overrides in 
> SparkHadoopUtil`
> 
>
> Key: SPARK-33104
> URL: https://issues.apache.org/jira/browse/SPARK-33104
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1377/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/yarn_cluster_should_respect_conf_overrides_in_SparkHadoopUtil__SPARK_16414__SPARK_23630_/
> {code}
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exit code from container container_1602245728426_0006_02_01 is : 15
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
> Exception from container-launch with container ID: 
> container_1602245728426_0006_02_01 and exit code: 15
> ExitCodeException exitCode=15: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
>   at org.apache.hadoop.util.Shell.run(Shell.java:482)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/09 05:18:13.211 ContainersLauncher #0 WARN ContainerLaunch: Container 
> exited with a non-zero exit code 15
> 20/10/09 05:18:13.237 AsyncDispatcher event handler WARN NMAuditLogger: 
> USER=jenkins  OPERATION=Container Finished - Failed   TARGET=ContainerImpl
> RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE  
>   APPID=application_1602245728426_0006
> CONTAINERID=container_1602245728426_0006_02_01
> 20/10/09 05:18:13.244 Socket Reader #1 for port 37112 INFO Server: Auth 
> successful for appattempt_1602245728426_0006_02 (auth:SIMPLE)
> 20/10/09 05:18:13.326 IPC Parameter Sending Thread #0 DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins sending #37
> 20/10/09 05:18:13.327 IPC Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins DEBUG Client: IPC 
> Client (1123559518) connection to 
> amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins got value #37
> 20/10/09 05:18:13.328 main DEBUG ProtobufRpcEngine: Call: 
> getApplicationReport took 2ms
> 20/10/09 05:18:13.328 main INFO Client: Application report for 
> application_1602245728426_0006 (state: FINISHED)
> 20/10/09 05:18:13.328 main DEBUG Client: 
>client token: N/A
>diagnostics: User class threw exception: 
> org.scalatest.exceptions.TestFailedException: null was not equal to 
> "testvalue"
>   at 
> org.scalatest.matchers.MatchersHelper$.indicateFailure(MatchersHelper.scala:344)
>   at 
> org.scalatest.matchers.should.Matchers$ShouldMethodHelperClass.shouldMatcher(Matchers.scala:6778)
>   at 
> org.scalatest.matchers.should.Matchers$AnyShouldWrapper.should(Matchers.scala:6822)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.$anonfun$main$2(YarnClusterSuite.scala:383)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.main(YarnClusterSuite.scala:382)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf.main(YarnClusterSuite.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.D

[jira] [Commented] (SPARK-32082) Project Zen: Improving Python usability

2020-10-09 Thread Andrew Malone Melo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211474#comment-17211474
 ] 

Andrew Malone Melo commented on SPARK-32082:


> Potentially better interface by leveraging Arrow

Is there an open Jira for this?

> Project Zen: Improving Python usability
> ---
>
> Key: SPARK-32082
> URL: https://issues.apache.org/jira/browse/SPARK-32082
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> The importance of Python and PySpark has grown radically in the last few 
> years. The number of PySpark downloads reached [more than 1.3 million _every 
> week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
> PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
> messages as an example, and the API documentation is poorly written.
> This epic tickets aims to improve the usability in PySpark, and make it more 
> Pythonic. To be more explicit, this JIRA targets four bullet points below. 
> Each includes examples:
>  * Being Pythonic
>  ** Pandas UDF enhancements and type hints
>  ** Avoid dynamic function definitions, for example, at {{funcitons.py}} 
> which makes IDEs unable to detect.
>  * Better and easier usability in PySpark
>  ** User-facing error message and warnings
>  ** Documentation
>  ** User guide
>  ** Better examples and API documentation, e.g. 
> [Koalas|https://koalas.readthedocs.io/en/latest/] and 
> [pandas|https://pandas.pydata.org/docs/]
>  * Better interoperability with other Python libraries
>  ** Visualization and plotting
>  ** Potentially better interface by leveraging Arrow
>  ** Compatibility with other libraries such as NumPy universal functions or 
> pandas possibly by leveraging Koalas
>  * PyPI Installation
>  ** PySpark with Hadoop 3 support on PyPi
>  ** Better error handling



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33105) Broken installation of source packages on AppVeyor

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33105:


Assignee: (was: Apache Spark)

> Broken installation of source packages on AppVeyor
> --
>
> Key: SPARK-33105
> URL: https://issues.apache.org/jira/browse/SPARK-33105
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 3.1.0
> Environment: *strong text*
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> It looks like AppVeyor configuration is broken, which leads to failure of 
> installation of  source packages (become a problem when {{rlang}} has been 
> updated from 0.4.7 and 0.4.8, with latter available only as a source package).
> {code}
> [00:01:48] trying URL
> 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
> [00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
> [00:01:48] ==
> [00:01:48] downloaded 827 KB
> [00:01:48] 
> [00:01:48] Warning in strptime(xx, f, tz = tz) :
> [00:01:48]   unable to identify current timezone 'C':
> [00:01:48] please set environment variable 'TZ'
> [00:01:49] * installing *source* package 'rlang' ...
> [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
> [00:01:49] ** using staged installation
> [00:01:49] ** libs
> [00:01:49] 
> [00:01:49] *** arch - i386
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c capture.c -o capture.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c export.c -o export.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c internal.c -o internal.o
> [00:01:50] In file included from ./lib/rlang.h:74,
> [00:01:50]  from internal/arg.c:1,
> [00:01:50]  from internal.c:1:
> [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
> [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> [00:01:50]return ENCLOS(env);
> [00:01:50]   ^~~
> [00:01:50] In file included from internal.c:8:
> [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
> [00:01:50]sexp* top;
> [00:01:50]  ^~~
> [00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c lib.c -o lib.o
> [00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c version.c -o version.o
> [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
> rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
> -LC:/R/bin/i386 -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> cannot find -lR
> [00:01:52] collect2.exe: error: ld returned 1 exit status
> [00:01:52] no DLL was created
> [00:01:52] ERROR: compilation failed for package 'rlang'
> [00:01:52] * removing 'C:/RLibrary/rlang'
> [00:01:52] 
> [00:01:52] The downloaded source packages are in
> [00:01:52]
> 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
> [00:01:52] Warning message:
> [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
> "e1071",  :
> [00:01:52]   installation of package 'rlang' had non-zero exit status 
> {code}
> This leads to failures to install {{devtools}} and generate Rd files and, as 
> a result, CRAN check failure.
> There are some discrepancies in the 
> {{dev/appveyor-install-dependencies.ps1}}, but the direct source of this 
> issue seems to be {{$env:BINPREF}}, which forces usage of 64 bit mingw, even 
> if packages are compiled for 32 bit. 
> Modifying the variable to include current architecture:
> {code}
> $env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
> {code}
> (as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks 
> like a valid fix, though we might want to clean remaining issues as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (SPARK-33105) Broken installation of source packages on AppVeyor

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33105:


Assignee: Apache Spark

> Broken installation of source packages on AppVeyor
> --
>
> Key: SPARK-33105
> URL: https://issues.apache.org/jira/browse/SPARK-33105
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 3.1.0
> Environment: *strong text*
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> It looks like AppVeyor configuration is broken, which leads to failure of 
> installation of  source packages (become a problem when {{rlang}} has been 
> updated from 0.4.7 and 0.4.8, with latter available only as a source package).
> {code}
> [00:01:48] trying URL
> 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
> [00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
> [00:01:48] ==
> [00:01:48] downloaded 827 KB
> [00:01:48] 
> [00:01:48] Warning in strptime(xx, f, tz = tz) :
> [00:01:48]   unable to identify current timezone 'C':
> [00:01:48] please set environment variable 'TZ'
> [00:01:49] * installing *source* package 'rlang' ...
> [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
> [00:01:49] ** using staged installation
> [00:01:49] ** libs
> [00:01:49] 
> [00:01:49] *** arch - i386
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c capture.c -o capture.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c export.c -o export.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c internal.c -o internal.o
> [00:01:50] In file included from ./lib/rlang.h:74,
> [00:01:50]  from internal/arg.c:1,
> [00:01:50]  from internal.c:1:
> [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
> [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> [00:01:50]return ENCLOS(env);
> [00:01:50]   ^~~
> [00:01:50] In file included from internal.c:8:
> [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
> [00:01:50]sexp* top;
> [00:01:50]  ^~~
> [00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c lib.c -o lib.o
> [00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c version.c -o version.o
> [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
> rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
> -LC:/R/bin/i386 -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> cannot find -lR
> [00:01:52] collect2.exe: error: ld returned 1 exit status
> [00:01:52] no DLL was created
> [00:01:52] ERROR: compilation failed for package 'rlang'
> [00:01:52] * removing 'C:/RLibrary/rlang'
> [00:01:52] 
> [00:01:52] The downloaded source packages are in
> [00:01:52]
> 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
> [00:01:52] Warning message:
> [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
> "e1071",  :
> [00:01:52]   installation of package 'rlang' had non-zero exit status 
> {code}
> This leads to failures to install {{devtools}} and generate Rd files and, as 
> a result, CRAN check failure.
> There are some discrepancies in the 
> {{dev/appveyor-install-dependencies.ps1}}, but the direct source of this 
> issue seems to be {{$env:BINPREF}}, which forces usage of 64 bit mingw, even 
> if packages are compiled for 32 bit. 
> Modifying the variable to include current architecture:
> {code}
> $env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
> {code}
> (as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks 
> like a valid fix, though we might want to clean remaining issues as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Commented] (SPARK-33105) Broken installation of source packages on AppVeyor

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211456#comment-17211456
 ] 

Apache Spark commented on SPARK-33105:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/29991

> Broken installation of source packages on AppVeyor
> --
>
> Key: SPARK-33105
> URL: https://issues.apache.org/jira/browse/SPARK-33105
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 3.1.0
> Environment: *strong text*
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> It looks like AppVeyor configuration is broken, which leads to failure of 
> installation of  source packages (become a problem when {{rlang}} has been 
> updated from 0.4.7 and 0.4.8, with latter available only as a source package).
> {code}
> [00:01:48] trying URL
> 'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
> [00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
> [00:01:48] ==
> [00:01:48] downloaded 827 KB
> [00:01:48] 
> [00:01:48] Warning in strptime(xx, f, tz = tz) :
> [00:01:48]   unable to identify current timezone 'C':
> [00:01:48] please set environment variable 'TZ'
> [00:01:49] * installing *source* package 'rlang' ...
> [00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
> [00:01:49] ** using staged installation
> [00:01:49] ** libs
> [00:01:49] 
> [00:01:49] *** arch - i386
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c capture.c -o capture.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c export.c -o export.o
> [00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c internal.c -o internal.o
> [00:01:50] In file included from ./lib/rlang.h:74,
> [00:01:50]  from internal/arg.c:1,
> [00:01:50]  from internal.c:1:
> [00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
> [00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> [00:01:50]return ENCLOS(env);
> [00:01:50]   ^~~
> [00:01:50] In file included from internal.c:8:
> [00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
> [00:01:50]sexp* top;
> [00:01:50]  ^~~
> [00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c lib.c -o lib.o
> [00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
> -I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
> -mstackrealign -c version.c -o version.o
> [00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
> rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
> -LC:/R/bin/i386 -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
> [00:01:52]
> c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
> cannot find -lR
> [00:01:52] collect2.exe: error: ld returned 1 exit status
> [00:01:52] no DLL was created
> [00:01:52] ERROR: compilation failed for package 'rlang'
> [00:01:52] * removing 'C:/RLibrary/rlang'
> [00:01:52] 
> [00:01:52] The downloaded source packages are in
> [00:01:52]
> 'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
> [00:01:52] Warning message:
> [00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
> "e1071",  :
> [00:01:52]   installation of package 'rlang' had non-zero exit status 
> {code}
> This leads to failures to install {{devtools}} and generate Rd files and, as 
> a result, CRAN check failure.
> There are some discrepancies in the 
> {{dev/appveyor-install-dependencies.ps1}}, but the direct source of this 
> issue seems to be {{$env:BINPREF}}, which forces usage of 64 bit mingw, even 
> if packages are compiled for 32 bit. 
> Modifying the variable to include current architecture:
> {code}
> $env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
> {code}
> (as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks 
> like a valid fix, though we might want to clean remaining issues as well.

[jira] [Commented] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211453#comment-17211453
 ] 

Bruce Robbins commented on SPARK-33098:
---

I  left out one case, which I added to the bottom to the  description.

All the cases are covered by the PR for SPARK-25056, except for the last one, 
which still throws an exception ('Filtering is supported only on partition keys 
of type string').

> Exception when using 'in' to compare a partition column to a literal with the 
> wrong type
> 
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Comparing a partition column against a literal with the wrong type works if 
> you use equality ('='). However, if you use 'in', you get:
> {noformat}
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> {noformat}
> For example:
> {noformat}
> spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
> Time taken: 0.323 seconds
> spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updated size to 418
> 20/10/08 19:57:14 WARN log: Updated size to 836
> Time taken: 2.124 seconds
> spark-sql> -- this works, of course
> spark-sql> select * from test where b in (2);
> 1 2
> 2 2
> Time taken: 0.13 seconds, Fetched 2 row(s)
> spark-sql> -- this also works (equals with wrong type)
> spark-sql> select * from test where b = '2';
> 1 2
> 2 2
> Time taken: 0.132 seconds, Fetched 2 row(s)
> spark-sql> -- this does not work ('in' with wrong type)
> spark-sql> select * from test where b in ('2');
> 20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
> in ('2')]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> -
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string)
> {noformat}
> There are also interesting variations of this using the dataframe API:
> {noformat}
> scala> sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> +---+
> |b  |
> +---+
> |2  |
> |2  |
> +---+
> scala> sql("select cast(b as string) as b from test").filter("b in 
> (2)").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}
> Also this:
> {noformat}
> scala> sql("select cast(b as string) as b from test").filter("b in 
> ('2')").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-33098:
--
Description: 
Comparing a partition column against a literal with the wrong type works if you 
use equality ('='). However, if you use 'in', you get:
{noformat}
MetaException(message:Filtering is supported only on partition keys of type 
string)
{noformat}
For example:
{noformat}
spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
Time taken: 0.323 seconds
spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updated size to 418
20/10/08 19:57:14 WARN log: Updated size to 836
Time taken: 2.124 seconds

spark-sql> -- this works, of course
spark-sql> select * from test where b in (2);
1   2
2   2
Time taken: 0.13 seconds, Fetched 2 row(s)

spark-sql> -- this also works (equals with wrong type)
spark-sql> select * from test where b = '2';
1   2
2   2
Time taken: 0.132 seconds, Fetched 2 row(s)

spark-sql> -- this does not work ('in' with wrong type)
spark-sql> select * from test where b in ('2');
20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
in ('2')]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
-
-
-
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string)
{noformat}
There are also interesting variations of this using the dataframe API:
{noformat}
scala> sql("select cast(b as string) as b from test where b in (2)").show(false)
+---+
|b  |
+---+
|2  |
|2  |
+---+


scala> sql("select cast(b as string) as b from test").filter("b in 
(2)").show(false)
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
  at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
-
-
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
supported only on partition keys of type string
{noformat}
Also this:
{noformat}
scala> sql("select cast(b as string) as b from test").filter("b in 
('2')").show(false)
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
-
-
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
supported only on partition keys of type string
{noformat}


  was:
Comparing a partition column against a literal with the wrong type works if you 
use equality ('='). However, if you use 'in', you get:
{noformat}
MetaException(message:Filtering is supported only on partition keys of type 
string)
{noformat}
For example:
{noformat}
spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
Time taken: 0.323 seconds
spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updated size to 418
20/10/08 19:57:14 WARN log: Updated size to 836
Time taken: 2.124 seconds

spark-sql> -- this works, of course
spark-sql> select * from test where b in (2);
1   2
2   2
Time taken: 0.13 seconds, Fetched 2 row(s)

spark-sql> -- this also works (equals with wrong type)
spark-sql> select * from test where b = '2';
1   2
2   2
Time taken: 0.132 seconds, Fetched 2 row(s)

spark-sql> -- this does not work ('in' with wrong type)
spark-sql> select * from test where b in ('2');
20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
in ('2')]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/ji

[jira] [Updated] (SPARK-33105) Broken installation of source packages on AppVeyor

2020-10-09 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-33105:
---
Description: 
It looks like AppVeyor configuration is broken, which leads to failure of 
installation of  source packages (become a problem when {{rlang}} has been 
updated from 0.4.7 and 0.4.8, with latter available only as a source package).

{code}

[00:01:48] trying URL
'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
[00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
[00:01:48] ==
[00:01:48] downloaded 827 KB
[00:01:48] 
[00:01:48] Warning in strptime(xx, f, tz = tz) :
[00:01:48]   unable to identify current timezone 'C':
[00:01:48] please set environment variable 'TZ'
[00:01:49] * installing *source* package 'rlang' ...
[00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
[00:01:49] ** using staged installation
[00:01:49] ** libs
[00:01:49] 
[00:01:49] *** arch - i386
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c capture.c -o capture.o
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c export.c -o export.o
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c internal.c -o internal.o
[00:01:50] In file included from ./lib/rlang.h:74,
[00:01:50]  from internal/arg.c:1,
[00:01:50]  from internal.c:1:
[00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
[00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
in this function [-Wmaybe-uninitialized]
[00:01:50]return ENCLOS(env);
[00:01:50]   ^~~
[00:01:50] In file included from internal.c:8:
[00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
[00:01:50]sexp* top;
[00:01:50]  ^~~
[00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c lib.c -o lib.o
[00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c version.c -o version.o
[00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
-LC:/R/bin/i386 -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
cannot find -lR
[00:01:52] collect2.exe: error: ld returned 1 exit status
[00:01:52] no DLL was created
[00:01:52] ERROR: compilation failed for package 'rlang'
[00:01:52] * removing 'C:/RLibrary/rlang'
[00:01:52] 
[00:01:52] The downloaded source packages are in
[00:01:52]
'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
[00:01:52] Warning message:
[00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
"e1071",  :
[00:01:52]   installation of package 'rlang' had non-zero exit status 
{code}

This leads to failures to install {{devtools}} and generate Rd files and, as a 
result, CRAN check failure.

There are some discrepancies in the {{dev/appveyor-install-dependencies.ps1}}, 
but the direct source of this issue seems to be {{$env:BINPREF}}, which forces 
usage of 64 bit mingw, even if packages are compiled for 32 bit. 

Modifying the variable to include current architecture:

{code}
$env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
{code}

(as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks like 
a valid fix, though we might want to clean remaining issues as well.

  was:
It looks like AppVeyor configuration is broken, which leads to failure of 
installation of  source packages (become a problem when {{rlang}} has been 
updated from 0.4.7 and 0.4.8, with latter available only as a source package).

{code}

[00:01:48] trying URL
'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
[00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
[00:01:48] ==
[00:01:48] downloaded 827 KB
[00:01:48] 
[00:01:48] Warning in strptime(xx, f, tz = tz) :
[00:01:48]   unable to identify current timezone 'C':
[00:01:48] please set environment variable 'TZ'
[00:01:49] * installing *source* package 'rlang' ...
[00:01:49] ** package 'rlang' s

[jira] [Created] (SPARK-33105) Broken installation of source packages on AppVeyor

2020-10-09 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-33105:
--

 Summary: Broken installation of source packages on AppVeyor
 Key: SPARK-33105
 URL: https://issues.apache.org/jira/browse/SPARK-33105
 Project: Spark
  Issue Type: Bug
  Components: Project Infra, R
Affects Versions: 3.1.0
 Environment: *strong text*
Reporter: Maciej Szymkiewicz


It looks like AppVeyor configuration is broken, which leads to failure of 
installation of  source packages (become a problem when {{rlang}} has been 
updated from 0.4.7 and 0.4.8, with latter available only as a source package).

{code}

[00:01:48] trying URL
'https://cloud.r-project.org/src/contrib/rlang_0.4.8.tar.gz'
[00:01:48] Content type 'application/x-gzip' length 847517 bytes (827 KB)
[00:01:48] ==
[00:01:48] downloaded 827 KB
[00:01:48] 
[00:01:48] Warning in strptime(xx, f, tz = tz) :
[00:01:48]   unable to identify current timezone 'C':
[00:01:48] please set environment variable 'TZ'
[00:01:49] * installing *source* package 'rlang' ...
[00:01:49] ** package 'rlang' successfully unpacked and MD5 sums checked
[00:01:49] ** using staged installation
[00:01:49] ** libs
[00:01:49] 
[00:01:49] *** arch - i386
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c capture.c -o capture.o
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c export.c -o export.o
[00:01:49] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c internal.c -o internal.o
[00:01:50] In file included from ./lib/rlang.h:74,
[00:01:50]  from internal/arg.c:1,
[00:01:50]  from internal.c:1:
[00:01:50] internal/eval-tidy.c: In function 'rlang_tilde_eval':
[00:01:50] ./lib/env.h:33:10: warning: 'top' may be used uninitialized
in this function [-Wmaybe-uninitialized]
[00:01:50]return ENCLOS(env);
[00:01:50]   ^~~
[00:01:50] In file included from internal.c:8:
[00:01:50] internal/eval-tidy.c:406:9: note: 'top' was declared here
[00:01:50]sexp* top;
[00:01:50]  ^~~
[00:01:50] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c lib.c -o lib.o
[00:01:51] C:/Rtools40/mingw64/bin/gcc  -I"C:/R/include" -DNDEBUG
-I./lib/ -O2 -Wall  -std=gnu99 -mfpmath=sse -msse2
-mstackrealign -c version.c -o version.o
[00:01:52] C:/Rtools40/mingw64/bin/gcc -shared -s -static-libgcc -o
rlang.dll tmp.def capture.o export.o internal.o lib.o version.o
-LC:/R/bin/i386 -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
skipping incompatible C:/R/bin/i386/R.dll when searching for -lR
[00:01:52]
c:/Rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
cannot find -lR
[00:01:52] collect2.exe: error: ld returned 1 exit status
[00:01:52] no DLL was created
[00:01:52] ERROR: compilation failed for package 'rlang'
[00:01:52] * removing 'C:/RLibrary/rlang'
[00:01:52] 
[00:01:52] The downloaded source packages are in
[00:01:52]
'C:\Users\appveyor\AppData\Local\Temp\1\Rtmp8qrryA\downloaded_packages'
[00:01:52] Warning message:
[00:01:52] In install.packages(c("knitr", "rmarkdown", "testthat",
"e1071",  :
[00:01:52]   installation of package 'rlang' had non-zero exit status 
{code}

There are some discrepancies in the {{dev/appveyor-install-dependencies.ps1}}, 
but the direct source of this issue seems to be {{$env:BINPREF}}, which forces 
usage of 64 bit mingw, even if packages are compiled for 32 bit. 

Modifying the variable to include current architecture:

{code}
$env:BINPREF=$RtoolsDrive + '/Rtools40/mingw$(WIN)/bin/'
{code}

(as proposed [here|https://stackoverflow.com/a/44035904] by R Yoda) looks like 
a valid fix, though we might want to clean remaining issues as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-33098:
--
Description: 
Comparing a partition column against a literal with the wrong type works if you 
use equality ('='). However, if you use 'in', you get:
{noformat}
MetaException(message:Filtering is supported only on partition keys of type 
string)
{noformat}
For example:
{noformat}
spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
Time taken: 0.323 seconds
spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updated size to 418
20/10/08 19:57:14 WARN log: Updated size to 836
Time taken: 2.124 seconds

spark-sql> -- this works, of course
spark-sql> select * from test where b in (2);
1   2
2   2
Time taken: 0.13 seconds, Fetched 2 row(s)

spark-sql> -- this also works (equals with wrong type)
spark-sql> select * from test where b = '2';
1   2
2   2
Time taken: 0.132 seconds, Fetched 2 row(s)

spark-sql> -- this does not work ('in' with wrong type)
spark-sql> select * from test where b in ('2');
20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
in ('2')]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
-
-
-
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string)
{noformat}
There are also interesting variations of this using the dataframe API:
{noformat}
scala> sql("select cast(b as string) as b from test where b in (2)").show(false)
+---+
|b  |
+---+
|2  |
|2  |
+---+


scala> sql("select cast(b as string) as b from test").filter("b in 
(2)").show(false)
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
  at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
-
-
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
supported only on partition keys of type string

scala> sql("select cast(b as string) as b from test").filter("b in 
('2')").show(false)
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
-
-
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
supported only on partition keys of type string
{noformat}


  was:
Comparing a partition column against a literal with the wrong type works if you 
use equality ('='). However, if you use 'in', you get:
{noformat}
MetaException(message:Filtering is supported only on partition keys of type 
string)
{noformat}
For example:
{noformat}
spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
Time taken: 0.323 seconds
spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updated size to 418
20/10/08 19:57:14 WARN log: Updated size to 836
Time taken: 2.124 seconds

spark-sql> -- this works, of course
spark-sql> select * from test where b in (2);
1   2
2   2
Time taken: 0.13 seconds, Fetched 2 row(s)

spark-sql> -- this also works (equals with wrong type)
spark-sql> select * from test where b = '2';
1   2
2   2
Time taken: 0.132 seconds, Fetched 2 row(s)

spark-sql> -- this does not work ('in' with wrong type)
spark-sql> select * from test where b in ('2');
20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
in ('2')]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.

[jira] [Updated] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-33098:
--
Description: 
Comparing a partition column against a literal with the wrong type works if you 
use equality ('='). However, if you use 'in', you get:
{noformat}
MetaException(message:Filtering is supported only on partition keys of type 
string)
{noformat}
For example:
{noformat}
spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
Time taken: 0.323 seconds
spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updated size to 418
20/10/08 19:57:14 WARN log: Updated size to 836
Time taken: 2.124 seconds

spark-sql> -- this works, of course
spark-sql> select * from test where b in (2);
1   2
2   2
Time taken: 0.13 seconds, Fetched 2 row(s)

spark-sql> -- this also works (equals with wrong type)
spark-sql> select * from test where b = '2';
1   2
2   2
Time taken: 0.132 seconds, Fetched 2 row(s)

spark-sql> -- this does not work ('in' with wrong type)
spark-sql> select * from test where b in ('2');
20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
in ('2')]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
-
-
-
Caused by: MetaException(message:Filtering is supported only on partition keys 
of type string)
{noformat}
There are also interesting variations of this using the dataframe API:
{noformat}
scala> sql("select cast(b as string) as b from test where b in (2)").show(false)
+---+
|b  |
+---+
|2  |
|2  |
+---+


scala> sql("select cast(b as string) as b from test").filter("b in 
(2)").show(false)
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
  at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
-
-
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
supported only on partition keys of type string

scala> sql("select cast(b as string) as b from test").filter("b in 
('2')").show(false)
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
supported only on partition keys of type string
{noformat}


  was:
Comparing a partition column against a literal with the wrong type works if you 
use equality ('='). However, if you use 'in', you get:
{noformat}
MetaException(message:Filtering is supported only on partition keys of type 
string)
{noformat}
For example:
{noformat}
spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
Time taken: 0.323 seconds
spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
20/10/08 19:57:14 WARN log: Updated size to 418
20/10/08 19:57:14 WARN log: Updated size to 836
Time taken: 2.124 seconds

spark-sql> -- this works, of course
spark-sql> select * from test where b in (2);
1   2
2   2
Time taken: 0.13 seconds, Fetched 2 row(s)

spark-sql> -- this also works (equals with wrong type)
spark-sql> select * from test where b = '2';
1   2
2   2
Time taken: 0.132 seconds, Fetched 2 row(s)

spark-sql> -- this does not work ('in' with wrong type)
spark-sql> select * from test where b in ('2');
20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
in ('2')]
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
at 
org.apac

[jira] [Assigned] (SPARK-31972) Improve heurestic for selecting nodes for scale down to take into account graceful decommission cost

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31972:


Assignee: Apache Spark

> Improve heurestic for selecting nodes for scale down to take into account 
> graceful decommission cost
> 
>
> Key: SPARK-31972
> URL: https://issues.apache.org/jira/browse/SPARK-31972
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> Once SPARK-31198 is in we should see if we can come up with a better graceful 
> decommissioning aware heuristic to use for selecting nodes to scale down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31972) Improve heurestic for selecting nodes for scale down to take into account graceful decommission cost

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211408#comment-17211408
 ] 

Apache Spark commented on SPARK-31972:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29993

> Improve heurestic for selecting nodes for scale down to take into account 
> graceful decommission cost
> 
>
> Key: SPARK-31972
> URL: https://issues.apache.org/jira/browse/SPARK-31972
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> Once SPARK-31198 is in we should see if we can come up with a better graceful 
> decommissioning aware heuristic to use for selecting nodes to scale down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31972) Improve heurestic for selecting nodes for scale down to take into account graceful decommission cost

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31972:


Assignee: (was: Apache Spark)

> Improve heurestic for selecting nodes for scale down to take into account 
> graceful decommission cost
> 
>
> Key: SPARK-31972
> URL: https://issues.apache.org/jira/browse/SPARK-31972
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> Once SPARK-31198 is in we should see if we can come up with a better graceful 
> decommissioning aware heuristic to use for selecting nodes to scale down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211407#comment-17211407
 ] 

Apache Spark commented on SPARK-32881:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29992

> NoSuchElementException occurs during decommissioning
> 
>
> Key: SPARK-32881
> URL: https://issues.apache.org/jira/browse/SPARK-32881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
> due to `java.util.NoSuchElementException`. This happens on K8s IT testing, 
> but the main code seems to need a graceful handling of 
> `NoSuchElementException` instead of showing a naive error message.
> {code}
> private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
> Seq[ReplicateBlock] = {
> val info = blockManagerInfo(blockManagerId)
>...
> }
> {code}
> {code}
>   20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes.
>   20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
> Executor decommission.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
>   20/09/14 18:56:55 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 1
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, 172.17.0.4, 41235, None)
>   20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
>   20/09/14 18:56:55 ERROR Inbox: Ignoring error
>   java.util.NoSuchElementException
>   at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
>   20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
> (epoch 1)
>   20/09/14 18:56:58 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with 
> ID 4,  ResourceProfileId 0
>   20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 
> 40495, None)
>   20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
> /opt/spark/tests/decommissioning.py:49
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32881:


Assignee: Apache Spark

> NoSuchElementException occurs during decommissioning
> 
>
> Key: SPARK-32881
> URL: https://issues.apache.org/jira/browse/SPARK-32881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> `BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
> due to `java.util.NoSuchElementException`. This happens on K8s IT testing, 
> but the main code seems to need a graceful handling of 
> `NoSuchElementException` instead of showing a naive error message.
> {code}
> private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
> Seq[ReplicateBlock] = {
> val info = blockManagerInfo(blockManagerId)
>...
> }
> {code}
> {code}
>   20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes.
>   20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
> Executor decommission.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
>   20/09/14 18:56:55 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 1
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, 172.17.0.4, 41235, None)
>   20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
>   20/09/14 18:56:55 ERROR Inbox: Ignoring error
>   java.util.NoSuchElementException
>   at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
>   20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
> (epoch 1)
>   20/09/14 18:56:58 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with 
> ID 4,  ResourceProfileId 0
>   20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 
> 40495, None)
>   20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
> /opt/spark/tests/decommissioning.py:49
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32881) NoSuchElementException occurs during decommissioning

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32881:


Assignee: (was: Apache Spark)

> NoSuchElementException occurs during decommissioning
> 
>
> Key: SPARK-32881
> URL: https://issues.apache.org/jira/browse/SPARK-32881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `BlockManagerMasterEndpoint` seems to fail at `getReplicateInfoForRDDBlocks` 
> due to `java.util.NoSuchElementException`. This happens on K8s IT testing, 
> but the main code seems to need a graceful handling of 
> `NoSuchElementException` instead of showing a naive error message.
> {code}
> private def getReplicateInfoForRDDBlocks(blockManagerId: BlockManagerId): 
> Seq[ReplicateBlock] = {
> val info = blockManagerInfo(blockManagerId)
>...
> }
> {code}
> {code}
>   20/09/14 18:56:54 INFO ExecutorPodsAllocator: Going to request 1 executors 
> from Kubernetes.
>   20/09/14 18:56:54 INFO BasicExecutorFeatureStep: Adding decommission script 
> to lifecycle
>   20/09/14 18:56:55 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.4: 
> Executor decommission.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removal of executor 1 requested
>   20/09/14 18:56:55 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
> non-existent executor 1
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, 172.17.0.4, 41235, None)
>   20/09/14 18:56:55 INFO DAGScheduler: Executor lost: 1 (epoch 1)
>   20/09/14 18:56:55 ERROR Inbox: Ignoring error
>   java.util.NoSuchElementException
>   at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$getReplicateInfoForRDDBlocks(BlockManagerMasterEndpoint.scala:383)
>   at 
> org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:171)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   20/09/14 18:56:55 INFO BlockManagerMasterEndpoint: Trying to remove 
> executor 1 from BlockManagerMaster.
>   20/09/14 18:56:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
>   20/09/14 18:56:55 INFO DAGScheduler: Shuffle files lost for executor: 1 
> (epoch 1)
>   20/09/14 18:56:58 INFO 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered 
> executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.7:46674) with 
> ID 4,  ResourceProfileId 0
>   20/09/14 18:56:58 INFO BlockManagerMasterEndpoint: Registering block 
> manager 172.17.0.7:40495 with 593.9 MiB RAM, BlockManagerId(4, 172.17.0.7, 
> 40495, None)
>   20/09/14 18:57:23 INFO SparkContext: Starting job: count at 
> /opt/spark/tests/decommissioning.py:49
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33104) Fix YarnClusterSuite.yarn-cluster should respect conf overrides in SparkHadoopUtil

2020-10-09 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33104:
-

 Summary: Fix YarnClusterSuite.yarn-cluster should respect conf 
overrides in SparkHadoopUtil
 Key: SPARK-33104
 URL: https://issues.apache.org/jira/browse/SPARK-33104
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/1377/testReport/org.apache.spark.deploy.yarn/YarnClusterSuite/yarn_cluster_should_respect_conf_overrides_in_SparkHadoopUtil__SPARK_16414__SPARK_23630_/

{code}
20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: Exit 
code from container container_1602245728426_0006_02_01 is : 15
20/10/09 05:18:13.211 ContainersLauncher #0 WARN DefaultContainerExecutor: 
Exception from container-launch with container ID: 
container_1602245728426_0006_02_01 and exit code: 15
ExitCodeException exitCode=15: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/10/09 05:18:13.211 ContainersLauncher #0 WARN ContainerLaunch: Container 
exited with a non-zero exit code 15
20/10/09 05:18:13.237 AsyncDispatcher event handler WARN NMAuditLogger: 
USER=jenkinsOPERATION=Container Finished - Failed   TARGET=ContainerImpl
RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE
APPID=application_1602245728426_0006
CONTAINERID=container_1602245728426_0006_02_01
20/10/09 05:18:13.244 Socket Reader #1 for port 37112 INFO Server: Auth 
successful for appattempt_1602245728426_0006_02 (auth:SIMPLE)
20/10/09 05:18:13.326 IPC Parameter Sending Thread #0 DEBUG Client: IPC Client 
(1123559518) connection to amp-jenkins-worker-04.amp/192.168.10.24:43090 from 
jenkins sending #37
20/10/09 05:18:13.327 IPC Client (1123559518) connection to 
amp-jenkins-worker-04.amp/192.168.10.24:43090 from jenkins DEBUG Client: IPC 
Client (1123559518) connection to amp-jenkins-worker-04.amp/192.168.10.24:43090 
from jenkins got value #37
20/10/09 05:18:13.328 main DEBUG ProtobufRpcEngine: Call: getApplicationReport 
took 2ms
20/10/09 05:18:13.328 main INFO Client: Application report for 
application_1602245728426_0006 (state: FINISHED)
20/10/09 05:18:13.328 main DEBUG Client: 
 client token: N/A
 diagnostics: User class threw exception: 
org.scalatest.exceptions.TestFailedException: null was not equal to "testvalue"
at 
org.scalatest.matchers.MatchersHelper$.indicateFailure(MatchersHelper.scala:344)
at 
org.scalatest.matchers.should.Matchers$ShouldMethodHelperClass.shouldMatcher(Matchers.scala:6778)
at 
org.scalatest.matchers.should.Matchers$AnyShouldWrapper.should(Matchers.scala:6822)
at 
org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.$anonfun$main$2(YarnClusterSuite.scala:383)
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at 
org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf$.main(YarnClusterSuite.scala:382)
at 
org.apache.spark.deploy.yarn.YarnClusterDriverUseSparkHadoopUtilConf.main(YarnClusterSuite.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:732)

 ApplicationMaster host: amp-jenkins-worker-04.amp
 ApplicationMaster RPC port: 36200
 queue: default
 start time: 1602245859148
 final status: FAILED
 tracking URL: 
http://amp-jenkins-worker-04.amp:39546/proxy/application_1602245728426_0

[jira] [Resolved] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2020-10-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-9686.
-
Resolution: Duplicate

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33103) Custom Schema with Custom RDD reorders columns when more than 4 added

2020-10-09 Thread Justin Mays (Jira)

Justin Mays created SPARK-33103:
---

 Summary: Custom Schema with Custom RDD reorders columns when more 
than 4 added
 Key: SPARK-33103
 URL: https://issues.apache.org/jira/browse/SPARK-33103
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
 Environment: Java Application
Reporter: Justin Mays


I have a custom RDD written in Java that uses a custom schema.  Everything 
appears to work fine with using 4 columns, but when i add a 5th column, calling 
show() fails with 

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Long is not a valid external type for schema of

here is the schema definition in java:

StructType schema = new StructType() StructType schema = new StructType() 
.add("recordId", DataTypes.LongType, false) .add("col1", DataTypes.DoubleType, 
false) .add("col2", DataTypes.DoubleType, false) .add("col3", 
DataTypes.IntegerType, false) .add("col4", DataTypes.IntegerType, false);

 

Here is the printout of schema.printTreeString();

== Physical Plan ==
*(1) Scan dw [recordId#0L,col1#1,col2#2,col3#3,col4#4] PushedFilters: [], 
ReadSchema: struct

 

I hardcoded a return in my Row object with values matching the schema:

@Override @Override public Object get(int i) \{ switch(i) { case 0: return 0L; 
case 1: return 1.1911950001644689D; case 2: return 9.10949955666E9D; case 
3: return 476; case 4: return 500; } return 0L; }

 

Here is the output of the show command:

15:30:26.875 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 
in stage 0.0 (TID 0)15:30:26.875 ERROR org.apache.spark.executor.Executor - 
Exception in task 0.0 in stage 0.0 (TID 0)java.lang.RuntimeException: Error 
while encoding: java.lang.RuntimeException: java.lang.Long is not a valid 
external type for schema of 
doublevalidateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, col1), DoubleType) AS 
col1#30validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 1, recordId), LongType) AS 
recordId#31Lvalidateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 2, col2), DoubleType) AS 
col2#32validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 3, col3), IntegerType) AS 
col3#33validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 4, col4), IntegerType) AS col4#34 at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:215)
 ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:197)
 ~[spark-catalyst_2.12-3.0.1.jar:3.0.1] at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:459) 
~[scala-library-2.12.10.jar:?] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
 ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
 ~[spark-sql_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
 ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.scheduler.Task.run(Task.scala:127) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 ~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) 
~[spark-core_2.12-3.0.1.jar:3.0.1] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) 
[spark-core_2.12-3.0.1.jar:3.0.1] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_265] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_265] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_265]Caused by: 
java.lang.RuntimeException: java.lang.Long is not a valid external type for 
schema of double at 
org.apache.spark.sql.catalyst.expre

[jira] [Commented] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition

2020-10-09 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211312#comment-17211312
 ] 

Aoyuan Liao commented on SPARK-32069:
-

[~Gengliang.Wang] If no one is working on it, can I take this one?

> Improve error message on reading unexpected directory which is not a table 
> partition
> 
>
> Key: SPARK-32069
> URL: https://issues.apache.org/jira/browse/SPARK-32069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>  Labels: starter
>
> To reproduce:
> {code:java}
> spark-sql> create table test(i long);
> spark-sql> insert into test values(1);
> {code}
> {code:java}
> bash $ mkdir ./spark-warehouse/test/data
> {code}
> There will be such error messge
> {code:java}
> java.io.IOException: Not a file: 
> file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282)
>   at 
> org.apache.spark.

[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2020-10-09 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211305#comment-17211305
 ] 

Aoyuan Liao commented on SPARK-9686:


[~srowen] I think SPARK-28426 fixed this.

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33039) Misleading watermark calculation in structure streaming

2020-10-09 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211294#comment-17211294
 ] 

Sean R. Owen commented on SPARK-33039:
--

Yeah I think that's an OK resolution. Thanks for resolving Sandish.

> Misleading watermark calculation in structure streaming
> ---
>
> Key: SPARK-33039
> URL: https://issues.apache.org/jira/browse/SPARK-33039
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Sandish Kumar HN
>Priority: Major
>
> source code:
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import java.sql.Timestamp
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.streaming.{ProcessingTime, Trigger}
> object TestWaterMark extends App {
>  val spark = SparkSession.builder().master("local").getOrCreate()
>  val sc = spark.sparkContext
>  val dir = new Path("/tmp/test-structured-streaming")
>  val fs = dir.getFileSystem(sc.hadoopConfiguration)
>  fs.mkdirs(dir)
>  val schema = StructType(StructField("vilue", StringType) ::
>  StructField("timestamp", TimestampType) ::
>  Nil)
>  val eventStream = spark
>  .readStream
>  .option("sep", ";")
>  .option("header", "false")
>  .schema(schema)
>  .csv(dir.toString)
>  // Watermarked aggregation
>  val eventsCount = eventStream
>  .withWatermark("timestamp", "5 seconds")
>  .groupBy(window(col("timestamp"), "10 seconds"))
>  .count
>  def writeFile(path: Path, data: String) {
>  val file = fs.create(path)
>  file.writeUTF(data)
>  file.close()
>  }
>  // Debug query
>  val query = eventsCount.writeStream
>  .format("console")
>  .outputMode("complete")
>  .option("truncate", "false")
>  .trigger(Trigger.ProcessingTime("5 seconds"))
>  .start()
>  writeFile(new Path(dir, "file1"), """
>  |OLD;2019-08-09 10:05:00
>  |OLD;2019-08-09 10:10:00
>  |OLD;2019-08-09 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp1 = query.lastProgress
>  println(lp1.eventTime)
>  writeFile(new Path(dir, "file2"), """
>  |NEW;2020-08-29 10:05:00
>  |NEW;2020-08-29 10:10:00
>  |NEW;2020-08-29 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp2 = query.lastProgress
>  println(lp2.eventTime)
>  writeFile(new Path(dir, "file4"), """
>  |OLD;2017-08-10 10:05:00
>  |OLD;2017-08-10 10:10:00
>  |OLD;2017-08-10 10:15:00""".stripMargin)
>  writeFile(new Path(dir, "file3"), "")
>  query.processAllAvailable()
>  val lp3 = query.lastProgress
>  println(lp3.eventTime)
>  query.awaitTermination()
>  fs.delete(dir, true)
> }
> {code}
> OUTPUT:
>  
> {code:java}
> ---
> Batch: 0
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2019-08-09T17:05:00.000Z, avg=2019-08-09T17:10:00.000Z, 
> watermark=1970-01-01T00:00:00.000Z, max=2019-08-09T17:15:00.000Z}
> ---
> Batch: 1
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2020-08-29T17:05:00.000Z, avg=2020-08-29T17:10:00.000Z, 
> watermark=2019-08-09T17:14:55.000Z, max=2020-08-29T17:15:00.000Z}
> ---
> Batch: 2
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2017-08-10 10:15:00, 2017-08-10 10:15:10]|1 |
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2017-08-10 10:05:00, 2017-08-10 10:05:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2017-08-10 10:10:00, 2017-08-10 10:10:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2017-08-10T17:05:00.000Z, avg=2017-08-10T17:10:00.000Z, 
> watermark=2020-08-29T17:14:55.000Z, max=2017-08-10T17:15:00.000Z}
> {

[jira] [Resolved] (SPARK-33039) Misleading watermark calculation in structure streaming

2020-10-09 Thread Sandish Kumar HN (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandish Kumar HN resolved SPARK-33039.
--
Resolution: Invalid

> Misleading watermark calculation in structure streaming
> ---
>
> Key: SPARK-33039
> URL: https://issues.apache.org/jira/browse/SPARK-33039
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Sandish Kumar HN
>Priority: Major
>
> source code:
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import java.sql.Timestamp
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.streaming.{ProcessingTime, Trigger}
> object TestWaterMark extends App {
>  val spark = SparkSession.builder().master("local").getOrCreate()
>  val sc = spark.sparkContext
>  val dir = new Path("/tmp/test-structured-streaming")
>  val fs = dir.getFileSystem(sc.hadoopConfiguration)
>  fs.mkdirs(dir)
>  val schema = StructType(StructField("vilue", StringType) ::
>  StructField("timestamp", TimestampType) ::
>  Nil)
>  val eventStream = spark
>  .readStream
>  .option("sep", ";")
>  .option("header", "false")
>  .schema(schema)
>  .csv(dir.toString)
>  // Watermarked aggregation
>  val eventsCount = eventStream
>  .withWatermark("timestamp", "5 seconds")
>  .groupBy(window(col("timestamp"), "10 seconds"))
>  .count
>  def writeFile(path: Path, data: String) {
>  val file = fs.create(path)
>  file.writeUTF(data)
>  file.close()
>  }
>  // Debug query
>  val query = eventsCount.writeStream
>  .format("console")
>  .outputMode("complete")
>  .option("truncate", "false")
>  .trigger(Trigger.ProcessingTime("5 seconds"))
>  .start()
>  writeFile(new Path(dir, "file1"), """
>  |OLD;2019-08-09 10:05:00
>  |OLD;2019-08-09 10:10:00
>  |OLD;2019-08-09 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp1 = query.lastProgress
>  println(lp1.eventTime)
>  writeFile(new Path(dir, "file2"), """
>  |NEW;2020-08-29 10:05:00
>  |NEW;2020-08-29 10:10:00
>  |NEW;2020-08-29 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp2 = query.lastProgress
>  println(lp2.eventTime)
>  writeFile(new Path(dir, "file4"), """
>  |OLD;2017-08-10 10:05:00
>  |OLD;2017-08-10 10:10:00
>  |OLD;2017-08-10 10:15:00""".stripMargin)
>  writeFile(new Path(dir, "file3"), "")
>  query.processAllAvailable()
>  val lp3 = query.lastProgress
>  println(lp3.eventTime)
>  query.awaitTermination()
>  fs.delete(dir, true)
> }
> {code}
> OUTPUT:
>  
> {code:java}
> ---
> Batch: 0
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2019-08-09T17:05:00.000Z, avg=2019-08-09T17:10:00.000Z, 
> watermark=1970-01-01T00:00:00.000Z, max=2019-08-09T17:15:00.000Z}
> ---
> Batch: 1
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2020-08-29T17:05:00.000Z, avg=2020-08-29T17:10:00.000Z, 
> watermark=2019-08-09T17:14:55.000Z, max=2020-08-29T17:15:00.000Z}
> ---
> Batch: 2
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2017-08-10 10:15:00, 2017-08-10 10:15:10]|1 |
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2017-08-10 10:05:00, 2017-08-10 10:05:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2017-08-10 10:10:00, 2017-08-10 10:10:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2017-08-10T17:05:00.000Z, avg=2017-08-10T17:10:00.000Z, 
> watermark=2020-08-29T17:14:55.000Z, max=2017-08-10T17:15:00.000Z}
> {code}
> EXPECTED:
> expected to drop the last batch events to get dropped as the watermark i

[jira] [Commented] (SPARK-33039) Misleading watermark calculation in structure streaming

2020-10-09 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211288#comment-17211288
 ] 

Aoyuan Liao commented on SPARK-33039:
-

[~srowen] This is actually not a bug. The user didn't fully understand the 
documenation. The output is correct as what we intended. 
Can we mark it as "not a problem"?

> Misleading watermark calculation in structure streaming
> ---
>
> Key: SPARK-33039
> URL: https://issues.apache.org/jira/browse/SPARK-33039
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Sandish Kumar HN
>Priority: Major
>
> source code:
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import java.sql.Timestamp
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.streaming.{ProcessingTime, Trigger}
> object TestWaterMark extends App {
>  val spark = SparkSession.builder().master("local").getOrCreate()
>  val sc = spark.sparkContext
>  val dir = new Path("/tmp/test-structured-streaming")
>  val fs = dir.getFileSystem(sc.hadoopConfiguration)
>  fs.mkdirs(dir)
>  val schema = StructType(StructField("vilue", StringType) ::
>  StructField("timestamp", TimestampType) ::
>  Nil)
>  val eventStream = spark
>  .readStream
>  .option("sep", ";")
>  .option("header", "false")
>  .schema(schema)
>  .csv(dir.toString)
>  // Watermarked aggregation
>  val eventsCount = eventStream
>  .withWatermark("timestamp", "5 seconds")
>  .groupBy(window(col("timestamp"), "10 seconds"))
>  .count
>  def writeFile(path: Path, data: String) {
>  val file = fs.create(path)
>  file.writeUTF(data)
>  file.close()
>  }
>  // Debug query
>  val query = eventsCount.writeStream
>  .format("console")
>  .outputMode("complete")
>  .option("truncate", "false")
>  .trigger(Trigger.ProcessingTime("5 seconds"))
>  .start()
>  writeFile(new Path(dir, "file1"), """
>  |OLD;2019-08-09 10:05:00
>  |OLD;2019-08-09 10:10:00
>  |OLD;2019-08-09 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp1 = query.lastProgress
>  println(lp1.eventTime)
>  writeFile(new Path(dir, "file2"), """
>  |NEW;2020-08-29 10:05:00
>  |NEW;2020-08-29 10:10:00
>  |NEW;2020-08-29 10:15:00""".stripMargin)
>  query.processAllAvailable()
>  val lp2 = query.lastProgress
>  println(lp2.eventTime)
>  writeFile(new Path(dir, "file4"), """
>  |OLD;2017-08-10 10:05:00
>  |OLD;2017-08-10 10:10:00
>  |OLD;2017-08-10 10:15:00""".stripMargin)
>  writeFile(new Path(dir, "file3"), "")
>  query.processAllAvailable()
>  val lp3 = query.lastProgress
>  println(lp3.eventTime)
>  query.awaitTermination()
>  fs.delete(dir, true)
> }
> {code}
> OUTPUT:
>  
> {code:java}
> ---
> Batch: 0
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2019-08-09T17:05:00.000Z, avg=2019-08-09T17:10:00.000Z, 
> watermark=1970-01-01T00:00:00.000Z, max=2019-08-09T17:15:00.000Z}
> ---
> Batch: 1
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2020-08-29T17:05:00.000Z, avg=2020-08-29T17:10:00.000Z, 
> watermark=2019-08-09T17:14:55.000Z, max=2020-08-29T17:15:00.000Z}
> ---
> Batch: 2
> ---
> +--+-+
> |window |count|
> +--+-+
> |[2017-08-10 10:15:00, 2017-08-10 10:15:10]|1 |
> |[2020-08-29 10:15:00, 2020-08-29 10:15:10]|1 |
> |[2017-08-10 10:05:00, 2017-08-10 10:05:10]|1 |
> |[2020-08-29 10:10:00, 2020-08-29 10:10:10]|1 |
> |[2019-08-09 10:05:00, 2019-08-09 10:05:10]|1 |
> |[2017-08-10 10:10:00, 2017-08-10 10:10:10]|1 |
> |[2020-08-29 10:05:00, 2020-08-29 10:05:10]|1 |
> |[2019-08-09 10:15:00, 2019-08-09 10:15:10]|1 |
> |[2019-08-09 10:10:00, 2019-08-09 10:10:10]|1 |
> +--+-+
> {min=2017-08-10T17:05:00.000Z, avg

[jira] [Resolved] (SPARK-31430) Bug in the approximate quantile computation.

2020-10-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31430.
--
Resolution: Duplicate

> Bug in the approximate quantile computation.
> 
>
> Key: SPARK-31430
> URL: https://issues.apache.org/jira/browse/SPARK-31430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Siddartha Naidu
>Priority: Major
> Attachments: approx_quantile_data.csv
>
>
> I am seeing a bug where passing lower relative error to the 
> {{approxQuantile}} function is leading to incorrect result in the presence of 
> partitions. Setting a relative error 1e-6 causes it to compute equal values 
> for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct 
> results. This issue was not present in spark version 2.4.5, we noticed it 
> when testing 3.0.0-preview.
> {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', 
> header=True, 
> schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
> {{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
> {{[1422576000.0, 1430352000.0, 1438300800.0]}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}
> {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
> 0.01)}}{color}
> {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
> {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16859) History Server storage information is missing

2020-10-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-16859.
--
Resolution: Not A Problem

Sounds fine. It's very old in any event.

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.

2020-10-09 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211286#comment-17211286
 ] 

Sean R. Owen commented on SPARK-31430:
--

Sounds good, I usually mark as a Duplicate.

> Bug in the approximate quantile computation.
> 
>
> Key: SPARK-31430
> URL: https://issues.apache.org/jira/browse/SPARK-31430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Siddartha Naidu
>Priority: Major
> Attachments: approx_quantile_data.csv
>
>
> I am seeing a bug where passing lower relative error to the 
> {{approxQuantile}} function is leading to incorrect result in the presence of 
> partitions. Setting a relative error 1e-6 causes it to compute equal values 
> for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct 
> results. This issue was not present in spark version 2.4.5, we noticed it 
> when testing 3.0.0-preview.
> {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', 
> header=True, 
> schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
> {{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
> {{[1422576000.0, 1430352000.0, 1438300800.0]}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}
> {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
> 0.01)}}{color}
> {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
> {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16859) History Server storage information is missing

2020-10-09 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211284#comment-17211284
 ] 

Aoyuan Liao commented on SPARK-16859:
-

[~srowen] After configuring "spark.eventLog.logBlockUpdates.enabled=true", it 
works on v3.0.1. Should we mark it as "not a problem"?

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2020-10-09 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211283#comment-17211283
 ] 

Sean R. Owen commented on SPARK-9686:
-

[~EveLiao] probably - do you know what other issue or PR might have resolved it 
so we can mark as a Duplicate? if we don't know, I usually just mark "Not a 
Problem" (anymore).

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31430) Bug in the approximate quantile computation.

2020-10-09 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211282#comment-17211282
 ] 

Aoyuan Liao commented on SPARK-31430:
-

[~srowen] This is already fixed.

> Bug in the approximate quantile computation.
> 
>
> Key: SPARK-31430
> URL: https://issues.apache.org/jira/browse/SPARK-31430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Siddartha Naidu
>Priority: Major
> Attachments: approx_quantile_data.csv
>
>
> I am seeing a bug where passing lower relative error to the 
> {{approxQuantile}} function is leading to incorrect result in the presence of 
> partitions. Setting a relative error 1e-6 causes it to compute equal values 
> for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct 
> results. This issue was not present in spark version 2.4.5, we noticed it 
> when testing 3.0.0-preview.
> {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', 
> header=True, 
> schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
> {{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
> {{[1422576000.0, 1430352000.0, 1438300800.0]}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}
> {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
> 0.01)}}{color}
> {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
> {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2020-10-09 Thread Aoyuan Liao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211280#comment-17211280
 ] 

Aoyuan Liao commented on SPARK-9686:


[~srowen] This is resolved on 3.0.1. Should we mark it as fixed?

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33062) Make DataFrameReader.jdbc work for DataSource V2

2020-10-09 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-33062.

Resolution: Not A Problem

> Make DataFrameReader.jdbc work for DataSource V2 
> -
>
> Key: SPARK-33062
> URL: https://issues.apache.org/jira/browse/SPARK-33062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Support multiple catalogs in DataFrameReader.jdbc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33010) Make DataFrameWriter.jdbc work for DataSource V2

2020-10-09 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-33010.

Resolution: Not A Problem

> Make DataFrameWriter.jdbc work for DataSource V2 
> -
>
> Key: SPARK-33010
> URL: https://issues.apache.org/jira/browse/SPARK-33010
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Support multiple catalogs in DataFrameWriter.jdbc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33081:


Assignee: Apache Spark

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33081:


Assignee: (was: Apache Spark)

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33102:


Assignee: (was: Apache Spark)

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33102:


Assignee: Apache Spark

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211166#comment-17211166
 ] 

Apache Spark commented on SPARK-33102:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/29989

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-09 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reopened SPARK-33081:


> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-09 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-33081:
---
Comment: was deleted

(was: This is done by smaller subtasks)

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-09 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211144#comment-17211144
 ] 

Huaxin Gao commented on SPARK-33081:


This is done by smaller subtasks

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33081) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)

2020-10-09 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-33081.

Resolution: Not A Problem

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (DB2 dialect)
> --
>
> Key: SPARK-33081
> URL: https://issues.apache.org/jira/browse/SPARK-33081
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> * ALTER TABLE UPDATE COLUMN TYPE
> * ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following DB2 JDBC dialect according to official documentation.
> Write DB2 integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1723#comment-1723
 ] 

Gabor Somogyi commented on SPARK-33102:
---

Filing a PR soon...

> Use stringToSeq on SQL list typed parameters
> 
>
> Key: SPARK-33102
> URL: https://issues.apache.org/jira/browse/SPARK-33102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33102) Use stringToSeq on SQL list typed parameters

2020-10-09 Thread Gabor Somogyi (Jira)

Gabor Somogyi created SPARK-33102:
-

 Summary: Use stringToSeq on SQL list typed parameters
 Key: SPARK-33102
 URL: https://issues.apache.org/jira/browse/SPARK-33102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)

2020-10-09 Thread Anika Kelhanka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211011#comment-17211011
 ] 

Anika Kelhanka edited comment on SPARK-25080 at 10/9/20, 2:24 PM:
--

I am able to produce this issue while querying a external Hive on parquet table 
from spark shell in Spark 2.4. The scenarios is:  Certain decimal fields in 
parquet have value higher than the precision defined in hive table. Basically, 
Parquet has a value that needs to be converted to a target type with not enough 
precision. 

scala> val df = spark.sql("select 'dummy' as name, 
100010.7010 as value")

scala> df.write.mode("Overwrite").parquet("/my/hdfs/location/test")

 

hive> create external table db1.test_precision(name string, value 
Decimal(18,6)) STORED As PARQUET LOCATION '/my/hdfs/location/test';

 

scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet","false")

scala> val df_hive = spark.sql("select * from db_gwm_morph_mrd.test_precision")

scala> df_hive.show
 20/10/09 09:33:12 WARN hadoop.ParquetRecordReader: Can not initialize counter 
due to context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 20/10/09 09:33:12 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
(TID 5)
 java.lang.NullPointerException
 at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:107)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:415)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:443)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:434)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

 

 

 

 

 


was (Author: anikakelhanka):
I am able to produce this issue while querying a external Hive on parquet table 
with value is higher than the precision hive table is defined with (significant 
side) from spark shell in Spark 2.4. Th



scala> val df = spark.sql("select 'dummy' as name, 
100010.7010 as value")

scala> df.write.mode("Overwrite").parquet("/my/hdfs/location/test")

 

hive> create external table db1.test_precision(name string, value 
Decimal(18,6)) STORED As PARQUET LOCATION '/my/hdfs/location/test';

 

scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet","false")

scala> val df_hive = spark.sql("select * from db_gwm_morph_mrd.test_precision")

scala> df_hive.show
20/10/09 09:33:12 WARN hadoop.ParquetRecordReader: Can not initialize counter 
due to context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
20/10/09 09:33:12 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
(TID 5)
java.lang.NullPointerException
 at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:107)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:41

[jira] [Commented] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211040#comment-17211040
 ] 

Apache Spark commented on SPARK-33098:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29988

> Exception when using 'in' to compare a partition column to a literal with the 
> wrong type
> 
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Comparing a partition column against a literal with the wrong type works if 
> you use equality ('='). However, if you use 'in', you get:
> {noformat}
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> {noformat}
> For example:
> {noformat}
> spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
> Time taken: 0.323 seconds
> spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updated size to 418
> 20/10/08 19:57:14 WARN log: Updated size to 836
> Time taken: 2.124 seconds
> spark-sql> -- this works, of course
> spark-sql> select * from test where b in (2);
> 1 2
> 2 2
> Time taken: 0.13 seconds, Fetched 2 row(s)
> spark-sql> -- this also works (equals with wrong type)
> spark-sql> select * from test where b = '2';
> 1 2
> 2 2
> Time taken: 0.132 seconds, Fetched 2 row(s)
> spark-sql> -- this does not work ('in' with wrong type)
> spark-sql> select * from test where b in ('2');
> 20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
> in ('2')]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> -
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string)
> {noformat}
> There are also interesting variations of this using the dataframe API:
> {noformat}
> scala> sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> +---+
> |b  |
> +---+
> |2  |
> |2  |
> +---+
> scala> sql("select cast(b as string) as b from test").filter("b in 
> (2)").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211043#comment-17211043
 ] 

Apache Spark commented on SPARK-33098:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/29988

> Exception when using 'in' to compare a partition column to a literal with the 
> wrong type
> 
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Comparing a partition column against a literal with the wrong type works if 
> you use equality ('='). However, if you use 'in', you get:
> {noformat}
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> {noformat}
> For example:
> {noformat}
> spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
> Time taken: 0.323 seconds
> spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updated size to 418
> 20/10/08 19:57:14 WARN log: Updated size to 836
> Time taken: 2.124 seconds
> spark-sql> -- this works, of course
> spark-sql> select * from test where b in (2);
> 1 2
> 2 2
> Time taken: 0.13 seconds, Fetched 2 row(s)
> spark-sql> -- this also works (equals with wrong type)
> spark-sql> select * from test where b = '2';
> 1 2
> 2 2
> Time taken: 0.132 seconds, Fetched 2 row(s)
> spark-sql> -- this does not work ('in' with wrong type)
> spark-sql> select * from test where b in ('2');
> 20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
> in ('2')]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> -
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string)
> {noformat}
> There are also interesting variations of this using the dataframe API:
> {noformat}
> scala> sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> +---+
> |b  |
> +---+
> |2  |
> |2  |
> +---+
> scala> sql("select cast(b as string) as b from test").filter("b in 
> (2)").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33098:


Assignee: (was: Apache Spark)

> Exception when using 'in' to compare a partition column to a literal with the 
> wrong type
> 
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Comparing a partition column against a literal with the wrong type works if 
> you use equality ('='). However, if you use 'in', you get:
> {noformat}
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> {noformat}
> For example:
> {noformat}
> spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
> Time taken: 0.323 seconds
> spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updated size to 418
> 20/10/08 19:57:14 WARN log: Updated size to 836
> Time taken: 2.124 seconds
> spark-sql> -- this works, of course
> spark-sql> select * from test where b in (2);
> 1 2
> 2 2
> Time taken: 0.13 seconds, Fetched 2 row(s)
> spark-sql> -- this also works (equals with wrong type)
> spark-sql> select * from test where b = '2';
> 1 2
> 2 2
> Time taken: 0.132 seconds, Fetched 2 row(s)
> spark-sql> -- this does not work ('in' with wrong type)
> spark-sql> select * from test where b in ('2');
> 20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
> in ('2')]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> -
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string)
> {noformat}
> There are also interesting variations of this using the dataframe API:
> {noformat}
> scala> sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> +---+
> |b  |
> +---+
> |2  |
> |2  |
> +---+
> scala> sql("select cast(b as string) as b from test").filter("b in 
> (2)").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33098:


Assignee: Apache Spark

> Exception when using 'in' to compare a partition column to a literal with the 
> wrong type
> 
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> Comparing a partition column against a literal with the wrong type works if 
> you use equality ('='). However, if you use 'in', you get:
> {noformat}
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> {noformat}
> For example:
> {noformat}
> spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
> Time taken: 0.323 seconds
> spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updated size to 418
> 20/10/08 19:57:14 WARN log: Updated size to 836
> Time taken: 2.124 seconds
> spark-sql> -- this works, of course
> spark-sql> select * from test where b in (2);
> 1 2
> 2 2
> Time taken: 0.13 seconds, Fetched 2 row(s)
> spark-sql> -- this also works (equals with wrong type)
> spark-sql> select * from test where b = '2';
> 1 2
> 2 2
> Time taken: 0.132 seconds, Fetched 2 row(s)
> spark-sql> -- this does not work ('in' with wrong type)
> spark-sql> select * from test where b in ('2');
> 20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
> in ('2')]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> -
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string)
> {noformat}
> There are also interesting variations of this using the dataframe API:
> {noformat}
> scala> sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> +---+
> |b  |
> +---+
> |2  |
> |2  |
> +---+
> scala> sql("select cast(b as string) as b from test").filter("b in 
> (2)").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211025#comment-17211025
 ] 

Apache Spark commented on SPARK-33094:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29987

> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211022#comment-17211022
 ] 

Apache Spark commented on SPARK-33094:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29987

> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)

2020-10-09 Thread Anika Kelhanka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17211011#comment-17211011
 ] 

Anika Kelhanka commented on SPARK-25080:


I am able to produce this issue while querying a external Hive on parquet table 
with value is higher than the precision hive table is defined with (significant 
side) from spark shell in Spark 2.4. Th



scala> val df = spark.sql("select 'dummy' as name, 
100010.7010 as value")

scala> df.write.mode("Overwrite").parquet("/my/hdfs/location/test")

 

hive> create external table db1.test_precision(name string, value 
Decimal(18,6)) STORED As PARQUET LOCATION '/my/hdfs/location/test';

 

scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet","false")

scala> val df_hive = spark.sql("select * from db_gwm_morph_mrd.test_precision")

scala> df_hive.show
20/10/09 09:33:12 WARN hadoop.ParquetRecordReader: Can not initialize counter 
due to context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
20/10/09 09:33:12 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 
(TID 5)
java.lang.NullPointerException
 at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:107)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:415)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:443)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:434)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)



 

 

 

 

 

> NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> --
>
> Key: SPARK-25080
> URL: https://issues.apache.org/jira/browse/SPARK-25080
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.1
> Environment: AWS EMR
>Reporter: Andrew K Long
>Priority: Minor
>
> NPE while reading hive table.
>  
> ```
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1190 in stage 392.0 failed 4 times, most recent failure: Lost task 
> 1190.3 in stage 392.0 (TID 122055, ip-172-31-32-196.ec2.internal, executor 
> 487): java.lang.NullPointerException
> at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> a

[jira] [Comment Edited] (SPARK-32924) Web UI sort on duration is wrong

2020-10-09 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210977#comment-17210977
 ] 

Rakesh Raushan edited comment on SPARK-32924 at 10/9/20, 1:49 PM:
--

I think its due to string sorting. One similar issue is fixed here SPARK-31983


was (Author: rakson):
I thinking its due to string sorting. One similar issue is fixed here 
[SPARK-31983|https://issues.apache.org/jira/browse/SPARK-31983]

> Web UI sort on duration is wrong
> 
>
> Key: SPARK-32924
> URL: https://issues.apache.org/jira/browse/SPARK-32924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: ui_sort.png
>
>
> See attachment, 9 s(econds) is showing as larger than 8.1min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32924) Web UI sort on duration is wrong

2020-10-09 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210977#comment-17210977
 ] 

Rakesh Raushan commented on SPARK-32924:


I thinking its due to string sorting. One similar issue is fixed here 
[SPARK-31983|https://issues.apache.org/jira/browse/SPARK-31983]

> Web UI sort on duration is wrong
> 
>
> Key: SPARK-32924
> URL: https://issues.apache.org/jira/browse/SPARK-32924
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.6
>Reporter: t oo
>Priority: Major
> Attachments: ui_sort.png
>
>
> See attachment, 9 s(econds) is showing as larger than 8.1min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33094:
-
Fix Version/s: 3.0.2

> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33101:
-
Fix Version/s: 3.0.2
   2.4.8

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210808#comment-17210808
 ] 

Apache Spark commented on SPARK-33101:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29986

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210807#comment-17210807
 ] 

Apache Spark commented on SPARK-33101:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29986

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210799#comment-17210799
 ] 

Apache Spark commented on SPARK-33094:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29985

> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33094) ORC format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210798#comment-17210798
 ] 

Apache Spark commented on SPARK-33094:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29985

> ORC format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> --
>
> Key: SPARK-33094
> URL: https://issues.apache.org/jira/browse/SPARK-33094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("orc").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32896) Add DataStreamWriter.table API

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32896.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29767
[https://github.com/apache/spark/pull/29767]

> Add DataStreamWriter.table API
> --
>
> Key: SPARK-32896
> URL: https://issues.apache.org/jira/browse/SPARK-32896
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> For now, there's no way to write to the table (especially catalog table) even 
> the table is capable to handle streaming write.
> We can add DataStreamWriter.table API to let end users specify table as 
> provider, and let streaming query write into the table. That is just to 
> specify the table, and the overall usage of DataStreamWriter isn't changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33099) Respect executor idle timeout conf in ExecutorPodsAllocator

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33099.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29981
[https://github.com/apache/spark/pull/29981]

> Respect executor idle timeout conf in ExecutorPodsAllocator
> ---
>
> Key: SPARK-33099
> URL: https://issues.apache.org/jira/browse/SPARK-33099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33099) Respect executor idle timeout conf in ExecutorPodsAllocator

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33099:
-

Assignee: Dongjoon Hyun

> Respect executor idle timeout conf in ExecutorPodsAllocator
> ---
>
> Key: SPARK-33099
> URL: https://issues.apache.org/jira/browse/SPARK-33099
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33101.
---
Resolution: Fixed

Issue resolved by pull request 29984
[https://github.com/apache/spark/pull/29984]

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33093) Why do my Spark 3 jobs fail to use external shuffle service on YARN?

2020-10-09 Thread Julien (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210715#comment-17210715
 ] 

Julien commented on SPARK-33093:


That worked, [~yumwang]; thanks!

> Why do my Spark 3 jobs fail to use external shuffle service on YARN?
> 
>
> Key: SPARK-33093
> URL: https://issues.apache.org/jira/browse/SPARK-33093
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, Java API
>Affects Versions: 3.0.0
>Reporter: Julien
>Priority: Minor
>
> We are running a Spark-on-YARN setup, where each client uploads their own 
> Spark JARs for their job, to run in YARN executors. YARN exposes a shuffle 
> service on every NodeManager's 7337 port, and clients enable use of that.
> This has worked for a while, with clients using Spark 2 JARs, but we are 
> seeing issues when clients attempt to use Spark 3 JAR. When shuffling is 
> either disabled, or enabled but no use of the shuffle service is made, things 
> seems to continue working in Spark 3.
> When a Spark 3 job attempts to use the external service, we get a stack-trace 
> that looks like this:
> {noformat}java.lang.IllegalArgumentException: Unknown message type: 10
>   at 
> org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:71)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:154)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at ...{noformat}
> Message type 10 was introduced as of SPARK-27651, released in Spark 3.0.0; 
> this error hints at an older version of 
> {{BlockTransferMessage$Decoder.fromByteBuffer}} being used.
> {{ExternalShuffleBlockHandler}} was renamed to {{ExternalBlockHandler}} as of 
> SPARK-28593, also released in Spark 3.0.0; this stack-trace hints at an older 
> JAR being loaded.
> Our current Hadoop setup (Cloudera CDH parcels) is very likely to be 
> polluting the class-path with older JARs. Trying to figure out where the old 
> JARs come from, I added {{-verbose:class}} to the executor options, to log 
> all class loading.
> This is where things get interesting: there is no mention of the old 
> {{ExternalShuffleBlockHandler}} class anywhere, and 
> {{BlockTransferMessage$Decoder}} is reported as loaded from the Spark 3 JARs:
> {noformat}grep -E 
> 'org.apache.spark.network.shuffle.protocol.BlockTransferMessage|org.apache.spark.network.shuffle.ExternalShuffleBlockHandler|org.apache.spark.network.server.TransportRequestHandler|org.apache.spark.network.server.TransportChannelHandler|org.apache.spark.network.shuffle.ExternalBlockHandler'
>  example_shuffle_stdout.txt
> [Loaded org.apache.spark.network.server.TransportRequestHandler from 
> file:/hadoop/2/yarn/nm/filecache/0/2170513/spark-network-common_2.12-3.0.0.jar]
> [Loaded org.apache.spark.network.server.TransportChannelHandler from 
> file:/hadoop/2/yarn/nm/filecache/0/2170513/spark-network-common_2.12-3.0.0.jar]
> [Loaded org.apache.spark.network.shuffle.protocol.BlockTransferMessage from 
> file:/hadoop/1/yarn/nm/filecache/0/2170571/spark-network-shuffle_2.12-3.0.0.jar]
> [Loaded org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Type 
> from 
> file:/hadoop/1/yarn/nm/filecache/0/2170571/spark-network-shuffle_2.12-3.0.0.jar]
> [Loaded 
> org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder from 
> file:/hadoop/1/yarn/nm/filecache/0/2170571/spark-network-shuffle_2.12-3.0.0.jar]
> [Loaded org.apache.spark.network.server.TransportRequestHandler$1 from 
> file:/hadoop/2/yarn/nm/filecache/0/2170513/spark-network-common_2.12-3.0.0.jar]
> [Loaded 
> org.apache.spark.network.server.TransportRequestHandler$$Lambda$666/376989599 
> from org.apache.spark.network.server.TransportRequestHandler]{noformat}
> I do not know how this is possible:
> - is the executor reporting a stack-trace that comes from another process 
> rather than itself?
> - are old classes loaded without being reported by {{-verbose:class}}?
> I'm not sure how to investigate this further, as I failed to locate precisely 
> how the instance of {{RpcHandler}} is injected into the 
> {{TransportRequestHandler}} for my executors.
> I did try setting {{spark.executor.userClas

[jira] [Commented] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210692#comment-17210692
 ] 

Apache Spark commented on SPARK-33101:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29984

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210694#comment-17210694
 ] 

Apache Spark commented on SPARK-33101:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29984

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33101:


Assignee: Maxim Gekk  (was: Apache Spark)

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33101) LibSVM format does not propagate Hadoop config from DS options to underlying HDFS file system

2020-10-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33101:


Assignee: Apache Spark  (was: Maxim Gekk)

> LibSVM format does not propagate Hadoop config from DS options to underlying 
> HDFS file system
> -
>
> Key: SPARK-33101
> URL: https://issues.apache.org/jira/browse/SPARK-33101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> When running:
> {code:java}
> spark.read.format("libsvm").options(conf).load(path)
> {code}
> The underlying file system will not receive the `conf` options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13860) TPCDS query 39 returns wrong results compared to TPC official result set

2020-10-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-13860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210691#comment-17210691
 ] 

Apache Spark commented on SPARK-13860:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29983

> TPCDS query 39 returns wrong results compared to TPC official result set 
> -
>
> Key: SPARK-13860
> URL: https://issues.apache.org/jira/browse/SPARK-13860
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.1, 2.2.0
>Reporter: JESSE CHEN
>Priority: Major
>  Labels: bulk-closed, tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 39 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> q39a - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])   ;  q39b 
> - 3 extra rows in SparkSQL output (eg. 
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733])
> Actual results 39a:
> {noformat}
> [1,265,1,324.75,1.2438391781531353,1,265,2,329.0,1.0151581328149208]
> [1,363,1,499.5,1.031941572270649,1,363,2,321.0,1.1411766752007977]
> [1,679,1,373.75,1.0955498064867504,1,679,2,417.5,1.042970994259454]
> [1,695,1,450.75,1.0835888283564505,1,695,2,368.75,1.1356494125569416]
> [1,789,1,357.25,1.03450938027956,1,789,2,410.0,1.0284221852702604]
> [1,815,1,216.5,1.1702270938111008,1,815,2,150.5,1.3057281471249382]
> [1,827,1,271.75,1.1046890134130438,1,827,2,424.75,1.1653198631238286]
> [1,1041,1,382.5,1.284808399803008,1,1041,2,424.75,1.000577271456812]
> [1,1155,1,184.0,NaN,1,1155,2,343.3,1.1700233592269733]
> [1,1569,1,212.0,1.630213519639535,1,1569,2,239.25,1.2641513267800557]
> [1,1623,1,338.25,1.1285483279713715,1,1623,2,261.3,1.2717809002195564]
> [1,2581,1,448.5,1.060429041250449,1,2581,2,476.25,1.0362984739390064]
> [1,2705,1,246.25,1.0120308357959693,1,2705,2,294.7,1.0742134101583702]
> [1,3131,1,393.75,1.0037613982687346,1,3131,2,480.5,1.0669144981482768]
> [1,3291,1,374.5,1.195189833087008,1,3291,2,265.25,1.572972106948466]
> [1,3687,1,279.75,1.4260909081999698,1,3687,2,157.25,1.4534340882531784]
> [1,4955,1,495.25,1.0318296151625301,1,4955,2,322.5,1.1693842343776149]
> [1,5627,1,282.75,1.5657032366359889,1,5627,2,297.5,1.2084286841430678]
> [1,7017,1,175.5,1.0427454215644427,1,7017,2,321.3,1.0183356932936254]
> [1,7317,1,366.3,1.025466403613547,1,7317,2,378.0,1.2172513189920555]
> [1,7569,1,430.5,1.0874396852180854,1,7569,2,360.25,1.047005559314515]
> [1,7999,1,166.25,1.7924231710846223,1,7999,2,375.3,1.008092263550718]
> [1,8319,1,306.75,1.1615378040478215,1,8319,2,276.0,1.1420996385609428]
> [1,8443,1,327.75,1.256718374192724,1,8443,2,332.5,1.0044167259988928]
> [1,8583,1,319.5,1.024108893111539,1,8583,2,310.25,1.2358813775861328]
> [1,8591,1,398.0,1.1478168692042447,1,8591,2,355.75,1.0024472149348966]
> [1,8611,1,300.5,1.5191545184147954,1,8611,2,243.75,1.2342122780960432]
> [1,9081,1,367.0,1.0878932141280895,1,9081,2,435.0,1.0330530776324107]
> [1,9357,1,351.7,1.1902922622025887,1,9357,2,427.0,1.0438583026358363]
> [1,9449,1,406.25,1.0183183104803557,1,9449,2,175.0,1.0544779796296408]
> [1,9713,1,242.5,1.1035044355064203,1,9713,2,393.0,1.208474608738988]
> [1,9809,1,479.0,1.0189602512117633,1,9809,2,317.5,1.0614142074924882]
> [1,9993,1,417.75,1.0099832672435247,1,9993,2,204.5,1.552870745350107]
> [1,10127,1,239.75,1.0561770587198123,1,10127,2,359.25,1.1857980403742183]
> [1,11159,1,407.25,1.0785507154337637,1,11159,2,250.0,1.334757905639321]
> [1,11277,1,211.25,1.2615858275316627,1,11277,2,330.75,1.0808767951625093]
> [1,11937,1,344.5,1.085804026843784,1,11937,2,200.34,1.0638527063883725]
> [1,12373,1,387.75,1.1014904822941258,1,12373,2,306.0,1.0761744390394028]
> [1,12471,1,365.25,1.0607570183728479,1,12471,2,327.25,1.0547560580567852]
> [1,12625,1,279.0,1.3016560542373208,1,12625,2,443.25,1.0604958838068959]
> [1,12751,1,280.75,1.10833057888089,1,12751,2,369.3,1.3416504398884601]
> [1,12779,1,331.0,1.041690207320035,1,12779,2,359.0,1.028978056175258]
> [1,13077,1,367.7,1.345523904195734,1,13077,2,358.7,1.5132429058096555]
> [1,13191,1,260.25,1.063569632291568,1,13191,2,405.0,1.0197999172180061]
> [1,13561,1,335.25,1.2609616961776389,1,13561,2,240.0,1.0513604502245155]
> [1,13935,1,311.75,1.0399289695412326,1,13935,2,275.0,1.0367527180321774]
> [1,14687,1,358.0,1.4369356919381713,1,14687,2,187.0,1.5493631531474956]
> [1,14719,1,209.0,1.0411509639707628,1,14719,2,489.0,1.376616882800804]
> [1,15345,1,148.5,1.5295784035794024,1,15345,2,246.5,1.5087987747231526]
> [1,15427,1,482.75,1.0124238928335

[jira] [Commented] (SPARK-33098) Exception when using 'in' to compare a partition column to a literal with the wrong type

2020-10-09 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210689#comment-17210689
 ] 

Peter Toth commented on SPARK-33098:


I've started to look into this issue.

> Exception when using 'in' to compare a partition column to a literal with the 
> wrong type
> 
>
> Key: SPARK-33098
> URL: https://issues.apache.org/jira/browse/SPARK-33098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Comparing a partition column against a literal with the wrong type works if 
> you use equality ('='). However, if you use 'in', you get:
> {noformat}
> MetaException(message:Filtering is supported only on partition keys of type 
> string)
> {noformat}
> For example:
> {noformat}
> spark-sql> create table test (a int) partitioned by (b int) stored as parquet;
> Time taken: 0.323 seconds
> spark-sql> insert into test values (1, 1), (1, 2), (2, 2);
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updating partition stats fast for: test
> 20/10/08 19:57:14 WARN log: Updated size to 418
> 20/10/08 19:57:14 WARN log: Updated size to 836
> Time taken: 2.124 seconds
> spark-sql> -- this works, of course
> spark-sql> select * from test where b in (2);
> 1 2
> 2 2
> Time taken: 0.13 seconds, Fetched 2 row(s)
> spark-sql> -- this also works (equals with wrong type)
> spark-sql> select * from test where b = '2';
> 1 2
> 2 2
> Time taken: 0.132 seconds, Fetched 2 row(s)
> spark-sql> -- this does not work ('in' with wrong type)
> spark-sql> select * from test where b in ('2');
> 20/10/08 19:58:30 ERROR SparkSQLDriver: Failed in [select * from test where b 
> in ('2')]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> -
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string)
> {noformat}
> There are also interesting variations of this using the dataframe API:
> {noformat}
> scala> sql("select cast(b as string) as b from test where b in 
> (2)").show(false)
> +---+
> |b  |
> +---+
> |2  |
> |2  |
> +---+
> scala> sql("select cast(b as string) as b from test").filter("b in 
> (2)").show(false)
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
> -
> -
> Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is 
> supported only on partition keys of type string
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33100) Support parse the sql statements with c-style comments

2020-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-33100:

Description: 
Now the spark-sql does not support parse the sql statements with C-style 
comments.
For the sql statements:
{code:java}
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.


  was:
Now the spark-sql does not support parse the sql statements with c-style 
coments.
For example:
For the sql statements:
{code:java}
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.



> Support parse the sql statements with c-style comments
> --
>
> Key: SPARK-33100
> URL: https://issues.apache.org/jira/browse/SPARK-33100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Assignee: Apache Spark
>Priority: Minor
>
> Now the spark-sql does not support parse the sql statements with C-style 
> comments.
> For the sql statements:
> {code:java}
> /* SELECT 'test'; */
> SELECT 'test';
> {code}
> Would be split to two statements:
> The first: "/* SELECT 'test'"
> The second: "*/ SELECT 'test'"
> Then it would throw an exception because the first one is illegal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 107 matches

Mail list logo