[jira] [Resolved] (SPARK-26808) Pruned schema should not change nullability
[ https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-26808. - Resolution: Won't Fix > Pruned schema should not change nullability > --- > > Key: SPARK-26808 > URL: https://issues.apache.org/jira/browse/SPARK-26808 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > We prune unnecessary nested fields from requested schema when reading > Parquet. Now seems we don't keep original nullability in pruned schema. We > should keep original nullability. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects
[ https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758906#comment-16758906 ] Boris Shminke commented on SPARK-18161: --- [~ssimmons] thanks for starting this work. [~hyukjin.kwon] thanks for guiding me during the review.:) > Default PickleSerializer pickle protocol doesn't handle > 4GB objects > - > > Key: SPARK-18161 > URL: https://issues.apache.org/jira/browse/SPARK-18161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.0.1 >Reporter: Sloane Simmons >Priority: Major > Fix For: 3.0.0 > > > When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there > is an error serializing the object with: > {{OverflowError: cannot serialize a bytes object larger than 4 GiB}} > in the stack trace. > This is because Python's pickle serialization (with protocol <= 3) uses a > 32-bit integer for the object size, and so cannot handle objects larger than > 4 gigabytes. This was changed in Protocol 4 of pickle > (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and > is available in Python 3.4+. > I would like to use this protocol for broadcasting and in the default > PickleSerializer where available to make pyspark more robust to broadcasting > large variables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26813) Consolidate java version across language compilers and build tools
[ https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26813: Assignee: Apache Spark > Consolidate java version across language compilers and build tools > -- > > Key: SPARK-26813 > URL: https://issues.apache.org/jira/browse/SPARK-26813 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Apache Spark >Priority: Minor > > The java version here means versions of javac source, javac target, scalac > target. They could be consolidated as a single version (currently 1.8) > || ||javac||scalac|| > |source|1.8|2.12/2.11| > |target|1.8|1.8| > The current issues are as follows > * Maven defines a single property to specify java version (java.version) > while SBT build defines different properties for javac (javacJVMVersion) and > scalac (scalacJVMVersion). SBT should use a single property as Maven does. > * Furthermore, it's better for SBT to refer to java.version defined by > Maven. This is possible since we've already been using sbt-pom-reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26813) Consolidate java version across language compilers and build tools
[ https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26813: Assignee: (was: Apache Spark) > Consolidate java version across language compilers and build tools > -- > > Key: SPARK-26813 > URL: https://issues.apache.org/jira/browse/SPARK-26813 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Minor > > The java version here means versions of javac source, javac target, scalac > target. They could be consolidated as a single version (currently 1.8) > || ||javac||scalac|| > |source|1.8|2.12/2.11| > |target|1.8|1.8| > The current issues are as follows > * Maven defines a single property to specify java version (java.version) > while SBT build defines different properties for javac (javacJVMVersion) and > scalac (scalacJVMVersion). SBT should use a single property as Maven does. > * Furthermore, it's better for SBT to refer to java.version defined by > Maven. This is possible since we've already been using sbt-pom-reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools
[ https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26813: - Description: The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * Furthermore, it's even better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. was: The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * Furthermore, it's even better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. > Consolidate java version across language compilers and build tools > -- > > Key: SPARK-26813 > URL: https://issues.apache.org/jira/browse/SPARK-26813 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Minor > > The java version here means versions of javac source, javac target, scalac > target. They could be consolidated as a single version (currently 1.8) > || ||javac||scalac|| > |source|1.8|2.12/2.11| > |target|1.8|1.8| > The current issues are as follows > * Maven defines a single property to specify java version (java.version) > while SBT build defines different properties for javac (javacJVMVersion) and > scalac (scalacJVMVersion). SBT should use a single property as Maven does. > * Furthermore, it's even better for SBT to refer to java.version defined by > Maven. This is possible since we've already been using sbt-pom-reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools
[ https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26813: - Description: The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * Furthermore, it's better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. was: The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * Furthermore, it's even better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. > Consolidate java version across language compilers and build tools > -- > > Key: SPARK-26813 > URL: https://issues.apache.org/jira/browse/SPARK-26813 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Minor > > The java version here means versions of javac source, javac target, scalac > target. They could be consolidated as a single version (currently 1.8) > || ||javac||scalac|| > |source|1.8|2.12/2.11| > |target|1.8|1.8| > The current issues are as follows > * Maven defines a single property to specify java version (java.version) > while SBT build defines different properties for javac (javacJVMVersion) and > scalac (scalacJVMVersion). SBT should use a single property as Maven does. > * Furthermore, it's better for SBT to refer to java.version defined by > Maven. This is possible since we've already been using sbt-pom-reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools
[ https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26813: - Description: The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * Furthermore, it's even better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. was: The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * For SBT build, both javac options and scalac options related to java version are provided. For Maven build, scala-maven-plugin compiles both Java and Scala code. However, javac options related to java version (-source, -target) are provided while scalac options related to java version (-target:TARGET) are not provided, which means scalac will depend on the default options (jvm-1.8). It's better for Maven build to explicitly provide scalac options as well. * Furthermore, it's even better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. > Consolidate java version across language compilers and build tools > -- > > Key: SPARK-26813 > URL: https://issues.apache.org/jira/browse/SPARK-26813 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Minor > > The java version here means versions of javac source, javac target, scalac > target. They could be consolidated as a single version (currently 1.8) > || ||javac||scalac|| > |source|1.8|2.12/2.11| > |target|1.8|1.8| > The current issues are as follows > * Maven defines a single property to specify java version (java.version) > while SBT build defines different properties for javac (javacJVMVersion) and > scalac (scalacJVMVersion). SBT should use a single property as Maven does. > * Furthermore, it's even better for SBT to refer to java.version defined by > Maven. This is possible since we've already been using sbt-pom-reader. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26813) Consolidate java version across language compilers and build tools
Chenxiao Mao created SPARK-26813: Summary: Consolidate java version across language compilers and build tools Key: SPARK-26813 URL: https://issues.apache.org/jira/browse/SPARK-26813 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Chenxiao Mao The java version here means versions of javac source, javac target, scalac target. They could be consolidated as a single version (currently 1.8) || ||javac||scalac|| |source|1.8|2.12/2.11| |target|1.8|1.8| The current issues are as follows * Maven defines a single property to specify java version (java.version) while SBT build defines different properties for javac (javacJVMVersion) and scalac (scalacJVMVersion). SBT should use a single property as Maven does. * For SBT build, both javac options and scalac options related to java version are provided. For Maven build, scala-maven-plugin compiles both Java and Scala code. However, javac options related to java version (-source, -target) are provided while scalac options related to java version (-target:TARGET) are not provided, which means scalac will depend on the default options (jvm-1.8). It's better for Maven build to explicitly provide scalac options as well. * Furthermore, it's even better for SBT to refer to java.version defined by Maven. This is possible since we've already been using sbt-pom-reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message
[ https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758878#comment-16758878 ] Hyukjin Kwon commented on SPARK-26810: -- Also workaround is super super easy. Just put one {{*}}: {code} from pyspark.sql import Row r = Row(*['a','b']) r('1', '2') {code} Is {{r = Row(['a','b'])}} usage documented somewhere? I think it was a mistake we supported. > Fixing SPARK-25072 broke existing code and fails to show error message > -- > > Key: SPARK-26810 > URL: https://issues.apache.org/jira/browse/SPARK-26810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Arttu Voutilainen >Priority: Minor > > Hey, > We upgraded Spark recently, and > https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail > after the upgrade. Annoyingly, the error message formatting also threw an > exception itself, thus hiding the message we should have seen. > Repro using gettyimages/docker-spark, on 2.4.0: > {code} > from pyspark.sql import Row > r = Row(['a','b']) > r('1', '2') > {code} > {code} > Traceback (most recent call last): > File "", line 1, in > File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__ > "but got %s" % (self, len(self), args)) > File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__ > return "" % ", ".join(self) > TypeError: sequence item 0: expected str instance, list found > {code} > On 2.3.1, and also showing how this was used: > {code} > from pyspark.sql import Row, types as T > r = Row(['a','b']) > df = spark.createDataFrame([Row(col='doesntmatter')]) > rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')]) > spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), > T.StructField('b', T.StringType())])).collect() > {code} > {code} > [Row(a='a1', b='b2'), Row(a='a1', b='b2')] > {code} > While I do think the code we had was quite horrible, it used to work. The > unexpected error came from __repr__ as it assumes that the arguments given to > Row constructor are strings. That sounds like a reasonable assumption, should > the Row constructor validate that it holds true maybe? (I guess that might be > another potentially breaking change though, if someone has as weird code as > this one...) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message
[ https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758876#comment-16758876 ] Hyukjin Kwon commented on SPARK-26810: -- {code} Traceback (most recent call last): File "", line 1, in File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__ "but got %s" % (self, len(self), args)) File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__ return "" % ", ".join(self) TypeError: sequence item 0: expected str instance, list found {code} Is another issue, I guess SPARK-23299. Are you sure SPARK-25072 is the cause? I don't see the relevant error messages. > Fixing SPARK-25072 broke existing code and fails to show error message > -- > > Key: SPARK-26810 > URL: https://issues.apache.org/jira/browse/SPARK-26810 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Arttu Voutilainen >Priority: Minor > > Hey, > We upgraded Spark recently, and > https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail > after the upgrade. Annoyingly, the error message formatting also threw an > exception itself, thus hiding the message we should have seen. > Repro using gettyimages/docker-spark, on 2.4.0: > {code} > from pyspark.sql import Row > r = Row(['a','b']) > r('1', '2') > {code} > {code} > Traceback (most recent call last): > File "", line 1, in > File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__ > "but got %s" % (self, len(self), args)) > File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__ > return "" % ", ".join(self) > TypeError: sequence item 0: expected str instance, list found > {code} > On 2.3.1, and also showing how this was used: > {code} > from pyspark.sql import Row, types as T > r = Row(['a','b']) > df = spark.createDataFrame([Row(col='doesntmatter')]) > rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')]) > spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), > T.StructField('b', T.StringType())])).collect() > {code} > {code} > [Row(a='a1', b='b2'), Row(a='a1', b='b2')] > {code} > While I do think the code we had was quite horrible, it used to work. The > unexpected error came from __repr__ as it assumes that the arguments given to > Row constructor are strings. That sounds like a reasonable assumption, should > the Row constructor validate that it holds true maybe? (I guess that might be > another potentially breaking change though, if someone has as weird code as > this one...) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26809) insert overwrite directory + concat function => error
[ https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758873#comment-16758873 ] Hyukjin Kwon commented on SPARK-26809: -- Is it able to post a self-contained reproducer? It will deduplicate efforts when other people start to investigate. > insert overwrite directory + concat function => error > - > > Key: SPARK-26809 > URL: https://issues.apache.org/jira/browse/SPARK-26809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: ant_nebula >Priority: Critical > > insert overwrite directory '/tmp/xx' > select concat(col1, col2) > from tableXX > limit 3 > > Caused by: org.apache.hadoop.hive.serde2.SerDeException: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements > while columns.types has 2 elements! > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119) > at > org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
[ https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26796. -- Resolution: Cannot Reproduce > Testcases failing with "org.apache.hadoop.fs.ChecksumException" error > - > > Key: SPARK-26796 > URL: https://issues.apache.org/jira/browse/SPARK-26796 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.2, 2.4.0 > Environment: Ubuntu 16.04 > Java Version > openjdk version "1.8.0_192" > OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9) > Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Compressed References > 20181107_80 (JIT enabled, AOT enabled) > OpenJ9 - 090ff9dcd > OMR - ea548a66 > JCL - b5a3affe73 based on jdk8u192-b12) > > Hadoop Version > Hadoop 2.7.1 > Subversion Unknown -r Unknown > Compiled by test on 2019-01-29T09:09Z > Compiled with protoc 2.5.0 > From source with checksum 5e94a235f9a71834e2eb73fb36ee873f > This command was run using > /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar > > > >Reporter: Anuja Jakhade >Priority: Major > > Observing test case failures due to Checksum error > Below is the error log > [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time > elapsed: 1.232 s <<< ERROR! > org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor > driver): org.apache.hadoop.fs.ChecksumException: Checksum error: > file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0 > at 0 exp: 222499834 got: 1400184476 > at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279) > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769) > at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785) > at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262) > at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968) > at java.io.ObjectInputStream.(ObjectInputStream.java:390) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:813) > Driver stacktrace: > at > test.org.apache.spark.JavaAPISuite.checkpointAndComputation(JavaAPISuite.java:1243) > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26804) Spark sql carries newline char from last csv column when imported
[ https://issues.apache.org/jira/browse/SPARK-26804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758872#comment-16758872 ] Hyukjin Kwon commented on SPARK-26804: -- Can you show your input file? It would be easier to verify the issue if there's a self-contained reproducer. I am leaving this JIRA resolved until the details are provided. > Spark sql carries newline char from last csv column when imported > - > > Key: SPARK-26804 > URL: https://issues.apache.org/jira/browse/SPARK-26804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Raj >Priority: Major > > I am trying to generate external sql tables in DataBricks using Spark sql > query. Below is my query. The query reads csv file and creates external table > but it carries the newline char while creating the last column. Is there a > way to resolve this issue? > > %sql > create table if not exists <> > using CSV > options ("header"="true", "inferschema"="true","multiLine"="true", > "escape"='"') > location -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26804) Spark sql carries newline char from last csv column when imported
[ https://issues.apache.org/jira/browse/SPARK-26804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26804. -- Resolution: Incomplete > Spark sql carries newline char from last csv column when imported > - > > Key: SPARK-26804 > URL: https://issues.apache.org/jira/browse/SPARK-26804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Raj >Priority: Major > > I am trying to generate external sql tables in DataBricks using Spark sql > query. Below is my query. The query reads csv file and creates external table > but it carries the newline char while creating the last column. Is there a > way to resolve this issue? > > %sql > create table if not exists <> > using CSV > options ("header"="true", "inferschema"="true","multiLine"="true", > "escape"='"') > location -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26801) Spark unable to read valid avro types
[ https://issues.apache.org/jira/browse/SPARK-26801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758869#comment-16758869 ] Hyukjin Kwon commented on SPARK-26801: -- Thanks for reporting this. Would you be interested in narrowing down the problem? > Spark unable to read valid avro types > - > > Key: SPARK-26801 > URL: https://issues.apache.org/jira/browse/SPARK-26801 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > Currently the external avro package reads avro schemas for type records only. > This is probably because of representation of InternalRow in spark sql. As a > result, if the avro file has anything other than a sequence of records it > fails to read it. > We faced this issue earlier while trying to read primitive types. We > encountered this again while trying to read an array of records. Below are > code examples trying to read valid avro data showing the stack traces. > {code:java} > spark.read.format("avro").load("avroTypes/randomInt.avro").show > java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL > StructType: > "int" > at > org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > ... 49 elided > == > scala> spark.read.format("avro").load("avroTypes/randomEnum.avro").show > java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL > StructType: > { > "type" : "enum", > "name" : "Suit", > "symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ] > } > at > org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > ... 49 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
[ https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758868#comment-16758868 ] Hyukjin Kwon commented on SPARK-26796: -- I'm unable to reproduce this in my local, and the tests look working fine in Jenkins. Can you run the tests via Maven or SBT? Let me leave this resolved until other people can reproduce via Maven or SBT not via IDE. > Testcases failing with "org.apache.hadoop.fs.ChecksumException" error > - > > Key: SPARK-26796 > URL: https://issues.apache.org/jira/browse/SPARK-26796 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.2, 2.4.0 > Environment: Ubuntu 16.04 > Java Version > openjdk version "1.8.0_192" > OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9) > Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Compressed References > 20181107_80 (JIT enabled, AOT enabled) > OpenJ9 - 090ff9dcd > OMR - ea548a66 > JCL - b5a3affe73 based on jdk8u192-b12) > > Hadoop Version > Hadoop 2.7.1 > Subversion Unknown -r Unknown > Compiled by test on 2019-01-29T09:09Z > Compiled with protoc 2.5.0 > From source with checksum 5e94a235f9a71834e2eb73fb36ee873f > This command was run using > /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar > > > >Reporter: Anuja Jakhade >Priority: Major > > Observing test case failures due to Checksum error > Below is the error log > [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time > elapsed: 1.232 s <<< ERROR! > org.apache.spark.SparkException: > Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor > driver): org.apache.hadoop.fs.ChecksumException: Checksum error: > file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0 > at 0 exp: 222499834 got: 1400184476 > at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279) > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769) > at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785) > at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262) > at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968) > at java.io.ObjectInputStream.(ObjectInputStream.java:390) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:813) > Driver stacktrace: > at > test.org.apache.spark.JavaAPISuite.checkpointAndComputation(JavaAPISuite.java:1243) > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26791) Some scala codes doesn't show friendly and some description about foreachBatch is misleading
[ https://issues.apache.org/jira/browse/SPARK-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758867#comment-16758867 ] Hyukjin Kwon commented on SPARK-26791: -- Can you post a PR to improve the doc? > Some scala codes doesn't show friendly and some description about > foreachBatch is misleading > > > Key: SPARK-26791 > URL: https://issues.apache.org/jira/browse/SPARK-26791 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.0 > Environment: NA >Reporter: chaiyongqiang >Priority: Minor > Attachments: foreachBatch.jpg, multi-watermark.jpg > > > [Introduction about > foreachbatch|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#foreachbatch] > [Introduction about > policy-for-handling-multiple-watermarks|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks] > The introduction about foreachBatch and > policy-for-handling-multiple-watermarks doesn't look good with the scala code. > Besides, when taking about foreachBatch using the uncache api which doesn't > exists, it may be misleading. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26807) Confusing documentation regarding installation from PyPi
[ https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758866#comment-16758866 ] Hyukjin Kwon commented on SPARK-26807: -- Can you post a PR? > Confusing documentation regarding installation from PyPi > > > Key: SPARK-26807 > URL: https://issues.apache.org/jira/browse/SPARK-26807 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Emmanuel Arias >Priority: Minor > > Hello! > I am new using Spark. Reading the documentation I think that is a little > confusing on Downloading section. > [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading] > write: "Scala and Java users can include Spark in their projects using its > Maven coordinates and in the future Python users can also install Spark from > PyPI.", I interpret that currently Spark is not on PyPi yet. But > [https://spark.apache.org/downloads.html] write: > "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To > install just run {{pip install pyspark}}." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26651: Assignee: Maxim Gekk (was: Apache Spark) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26651: Assignee: Apache Spark (was: Maxim Gekk) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26651. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23722 [https://github.com/apache/spark/pull/23722] > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > Fix For: 3.0.0 > > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26651: - Fix Version/s: (was: 3.0.0) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-26651: -- > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > Fix For: 3.0.0 > > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects
[ https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18161. -- Resolution: Fixed Fix Version/s: 3.0.0 This is fixed by upgrading cloudpickle at https://github.com/apache/spark/pull/20691 > Default PickleSerializer pickle protocol doesn't handle > 4GB objects > - > > Key: SPARK-18161 > URL: https://issues.apache.org/jira/browse/SPARK-18161 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0, 2.0.1 >Reporter: Sloane Simmons >Priority: Major > Fix For: 3.0.0 > > > When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there > is an error serializing the object with: > {{OverflowError: cannot serialize a bytes object larger than 4 GiB}} > in the stack trace. > This is because Python's pickle serialization (with protocol <= 3) uses a > 32-bit integer for the object size, and so cannot handle objects larger than > 4 gigabytes. This was changed in Protocol 4 of pickle > (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and > is available in Python 3.4+. > I would like to use this protocol for broadcasting and in the default > PickleSerializer where available to make pyspark more robust to broadcasting > large variables. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758825#comment-16758825 ] Rajesh Chandramohan commented on SPARK-21733: - Its based on symptom from the actual issue. When the there was a container limit in yarn cluster. the already spawned containers get sig term error. CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26812) PushProjectionThroughUnion nullability issue
Bogdan Raducanu created SPARK-26812: --- Summary: PushProjectionThroughUnion nullability issue Key: SPARK-26812 URL: https://issues.apache.org/jira/browse/SPARK-26812 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Bogdan Raducanu Union output data types are the output data types of the first child. However the other union children may have different values nullability. This means that we can't always push down a project on the children. To reproduce {code} Seq(Map("foo" -> "bar")).toDF("a").write.saveAsTable("table1") sql("SELECT 1 AS b").write.saveAsTable("table2") sql("CREATE OR REPLACE VIEW test1 AS SELECT map() AS a FROM table2 UNION ALL SELECT a FROM table1") sql("select * from test1").show {code} This fails becaus the plan is no longer resolved. The plan is broken by the PushProjectionThroughUnion rule which pushed down a cast to map with values nullability=true on a child with type map with values nullability=false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26714) The job whose partiton num is zero not shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-26714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26714. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23637 [https://github.com/apache/spark/pull/23637] > The job whose partiton num is zero not shown in WebUI > - > > Key: SPARK-26714 > URL: https://issues.apache.org/jira/browse/SPARK-26714 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1, 2.4.0 >Reporter: deshanxiao >Assignee: deshanxiao >Priority: Minor > Fix For: 3.0.0 > > > When the job's partiton is zero, it will still get a jobid but not shown in > ui.I think it's strange. > Example: > mkdir /home/test/testdir > sc.textFile("/home/test/testdir") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26714) The job whose partiton num is zero not shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-26714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26714: - Assignee: deshanxiao > The job whose partiton num is zero not shown in WebUI > - > > Key: SPARK-26714 > URL: https://issues.apache.org/jira/browse/SPARK-26714 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1, 2.4.0 >Reporter: deshanxiao >Assignee: deshanxiao >Priority: Minor > > When the job's partiton is zero, it will still get a jobid but not shown in > ui.I think it's strange. > Example: > mkdir /home/test/testdir > sc.textFile("/home/test/testdir") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26771) Make .unpersist(), .destroy() consistently non-blocking by default
[ https://issues.apache.org/jira/browse/SPARK-26771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26771. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23685 [https://github.com/apache/spark/pull/23685] > Make .unpersist(), .destroy() consistently non-blocking by default > -- > > Key: SPARK-26771 > URL: https://issues.apache.org/jira/browse/SPARK-26771 > Project: Spark > Issue Type: Improvement > Components: GraphX, Spark Core >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > See https://issues.apache.org/jira/browse/SPARK-26728 and > https://github.com/apache/spark/pull/23650 . > RDD and DataFrame expose an .unpersist() method with optional "blocking" > argument. So does Broadcast.destroy(). This argument is false by default > except for the Scala RDD (not Pyspark) implementation and its GraphX > subclasses. Most usages of these methods request non-blocking behavior > already, and indeed, it's not typical to want to wait for the resources to be > freed, except in tests asserting behavior about these methods (where blocking > is typically requested). > This proposes to make the default false across these methods, and adjust > callers to only request non-default blocking behavior where important, such > as in a few key tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26754) Add hasTrainingSummary to replace duplicate code in PySpark
[ https://issues.apache.org/jira/browse/SPARK-26754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26754: - Assignee: Huaxin Gao > Add hasTrainingSummary to replace duplicate code in PySpark > --- > > Key: SPARK-26754 > URL: https://issues.apache.org/jira/browse/SPARK-26754 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Python version of https://issues.apache.org/jira/browse/SPARK-20351. > Add HasTrainingSummary to avoid code duplicate related to training summary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26754) Add hasTrainingSummary to replace duplicate code in PySpark
[ https://issues.apache.org/jira/browse/SPARK-26754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26754. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23676 [https://github.com/apache/spark/pull/23676] > Add hasTrainingSummary to replace duplicate code in PySpark > --- > > Key: SPARK-26754 > URL: https://issues.apache.org/jira/browse/SPARK-26754 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Python version of https://issues.apache.org/jira/browse/SPARK-20351. > Add HasTrainingSummary to avoid code duplicate related to training summary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv
[ https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758677#comment-16758677 ] vishnuram selvaraj commented on SPARK-26786: Thanks [~hyukjin.kwon]. I have raised a git issue(https://github.com/uniVocity/univocity-parsers/issues/308) in univocity project as well. Will post here of any updates I get from there. > Handle to treat escaped newline characters('\r','\n') in spark csv > -- > > Key: SPARK-26786 > URL: https://issues.apache.org/jira/browse/SPARK-26786 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, SQL >Affects Versions: 2.3.0 >Reporter: vishnuram selvaraj >Priority: Major > > There are some systems like AWS redshift which writes csv files by escaping > newline characters('\r','\n') in addition to escaping the quote characters, > if they come as part of the data. > Redshift documentation > link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and > below is their mention of escaping requirements in the mentioned link > ESCAPE > For CHAR and VARCHAR columns in delimited unload files, an escape character > (\{{}}) is placed before every occurrence of the following characters: > * Linefeed: {{\n}} > * Carriage return: {{\r}} > * The delimiter character specified for the unloaded data. > * The escape character: \{{}} > * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are > specified in the UNLOAD command). > > *Problem statement:* > But the spark CSV reader doesn't have a handle to treat/remove the escape > characters infront of the newline characters in the data. > It would really help if we can add a feature to handle the escaped newline > characters through another parameter like (escapeNewline = 'true/false'). > *Example:* > Below are the details of my test data set up in a file. > * The first record in that file has escaped windows newline character ( > r > n) > * The third record in that file has escaped unix newline character ( > n) > * The fifth record in that file has the escaped quote character (") > the file looks like below in vi editor: > > {code:java} > "1","this is \^M\ > line1"^M > "2","this is line2"^M > "3","this is \ > line3"^M > "4","this is \" line4"^M > "5","this is line5"^M{code} > > When I read the file in python's csv module with escape, it is able to remove > the added escape characters as you can see below, > > {code:java} > >>> with open('/tmp/test3.csv','r') as readCsv: > ... readFile = > csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False) > ... for row in readFile: > ... print(row) > ... > ['1', 'this is \r\n line1'] > ['2', 'this is line2'] > ['3', 'this is \n line3'] > ['4', 'this is " line4'] > ['5', 'this is line5'] > {code} > But if I read the same file in spark-csv reader, the escape characters > infront of the newline characters are not removed.But the escape before the > (") is removed. > {code:java} > >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false') > >>> redDf.show() > +---+--+ > |_c0| _c1| > +---+--+ > \ 1|this is \ > line1| > | 2| this is line2| > | 3| this is \ > line3| > | 4| this is " line4| > | 5| this is line5| > +---+--+ > {code} > *Expected result:* > {code:java} > +---+--+ > |_c0| _c1| > +---+--+ > | 1|this is > line1| > | 2| this is line2| > | 3| this is > line3| > | 4| this is " line4| > | 5| this is line5| > +---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24541) TCP based shuffle
[ https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758673#comment-16758673 ] Jungtaek Lim commented on SPARK-24541: -- Same understanding here: while I think there's pretty less chance for us to want to deal with lower level than Netty, but we may also want to send amount of data close to the size of data structure. Btw, I don't know which thing Spark leverages to send (pull) shuffle data: whichever, it would be OK to also leverage it in here because it should be enough considered as a perspective of performance, security, etc. > TCP based shuffle > - > > Key: SPARK-24541 > URL: https://issues.apache.org/jira/browse/SPARK-24541 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-26651: --- Labels: ReleaseNote (was: ) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > Release notes: > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-26651: --- Description: Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in date/timestamp parsing, functions and expressions. The ticket aims to switch Spark on Proleptic Gregorian calendar, and use java.time classes introduced in Java 8 for timestamp/date manipulations. One of the purpose of switching on Proleptic Gregorian calendar is to conform to SQL standard which supposes such calendar. *Release note:* Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. It uses Java 8 API classes from the java.time packages that based on [ISO chronology |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. Previous versions of Spark performed those operations by using [the hybrid calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] (Julian + Gregorian). The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian). was: Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in date/timestamp parsing, functions and expressions. The ticket aims to switch Spark on Proleptic Gregorian calendar, and use java.time classes introduced in Java 8 for timestamp/date manipulations. One of the purpose of switching on Proleptic Gregorian calendar is to conform to SQL standard which supposes such calendar. Release notes: Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. It uses Java 8 API classes from the java.time packages that based on [ISO chronology |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. Previous versions of Spark performed those operations by using [the hybrid calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] (Julian + Gregorian). The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian). > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: ReleaseNote > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > *Release note:* > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-26651: --- Description: Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in date/timestamp parsing, functions and expressions. The ticket aims to switch Spark on Proleptic Gregorian calendar, and use java.time classes introduced in Java 8 for timestamp/date manipulations. One of the purpose of switching on Proleptic Gregorian calendar is to conform to SQL standard which supposes such calendar. Release notes: Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and etc. It uses Java 8 API classes from the java.time packages that based on [ISO chronology |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. Previous versions of Spark performed those operations by using [the hybrid calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] (Julian + Gregorian). The changes might impact on the results for dates and timestamps before October 15, 1582 (Gregorian). was:Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in date/timestamp parsing, functions and expressions. The ticket aims to switch Spark on Proleptic Gregorian calendar, and use java.time classes introduced in Java 8 for timestamp/date manipulations. One of the purpose of switching on Proleptic Gregorian calendar is to conform to SQL standard which supposes such calendar. > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. > Release notes: > Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, > formatting, and converting dates and timestamps as well as in extracting > sub-components like years, days and etc. It uses Java 8 API classes from the > java.time packages that based on [ISO chronology > |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]. > Previous versions of Spark performed those operations by using [the hybrid > calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html] > (Julian + Gregorian). The changes might impact on the results for dates and > timestamps before October 15, 1582 (Gregorian). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.
Ryan Blue created SPARK-26811: - Summary: Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis. Key: SPARK-26811 URL: https://issues.apache.org/jira/browse/SPARK-26811 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Ryan Blue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26651: Assignee: Maxim Gekk (was: Apache Spark) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26651: Assignee: Apache Spark (was: Maxim Gekk) > Use Proleptic Gregorian calendar > > > Key: SPARK-26651 > URL: https://issues.apache.org/jira/browse/SPARK-26651 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in > date/timestamp parsing, functions and expressions. The ticket aims to switch > Spark on Proleptic Gregorian calendar, and use java.time classes introduced > in Java 8 for timestamp/date manipulations. One of the purpose of switching > on Proleptic Gregorian calendar is to conform to SQL standard which supposes > such calendar. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26806: - Affects Version/s: 2.3.3 > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: liancheng >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0, 2.2.4 > > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message
Arttu Voutilainen created SPARK-26810: - Summary: Fixing SPARK-25072 broke existing code and fails to show error message Key: SPARK-26810 URL: https://issues.apache.org/jira/browse/SPARK-26810 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Arttu Voutilainen Hey, We upgraded Spark recently, and https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail after the upgrade. Annoyingly, the error message formatting also threw an exception itself, thus hiding the message we should have seen. Repro using gettyimages/docker-spark, on 2.4.0: {code} from pyspark.sql import Row r = Row(['a','b']) r('1', '2') {code} {code} Traceback (most recent call last): File "", line 1, in File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__ "but got %s" % (self, len(self), args)) File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__ return "" % ", ".join(self) TypeError: sequence item 0: expected str instance, list found {code} On 2.3.1, and also showing how this was used: {code} from pyspark.sql import Row, types as T r = Row(['a','b']) df = spark.createDataFrame([Row(col='doesntmatter')]) rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')]) spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), T.StructField('b', T.StringType())])).collect() {code} {code} [Row(a='a1', b='b2'), Row(a='a1', b='b2')] {code} While I do think the code we had was quite horrible, it used to work. The unexpected error came from __repr__ as it assumes that the arguments given to Row constructor are strings. That sounds like a reasonable assumption, should the Row constructor validate that it holds true maybe? (I guess that might be another potentially breaking change though, if someone has as weird code as this one...) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26806: - Fix Version/s: (was: 2.3.3) 2.3.4 > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0 >Reporter: liancheng >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.3.4, 2.4.1, 3.0.0, 2.2.4 > > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26806: - Affects Version/s: 2.2.2 2.2.3 > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: liancheng >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0, 2.2.4 > > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
[ https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-26806. -- Resolution: Fixed Fix Version/s: 3.0.0 2.4.1 2.3.3 2.2.4 > EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly > > > Key: SPARK-26806 > URL: https://issues.apache.org/jira/browse/SPARK-26806 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: liancheng >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0 > > > Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will > make "avg" become "NaN". And whatever gets merged with the result of > "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong > will return "0" and the user will see the following incorrect report: > {code} > "eventTime" : { > "avg" : "1970-01-01T00:00:00.000Z", > "max" : "2019-01-31T12:57:00.000Z", > "min" : "2019-01-30T18:44:04.000Z", > "watermark" : "1970-01-01T00:00:00.000Z" > } > {code} > This issue was reported by [~liancheng] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24961) sort operation causes out of memory
[ https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758557#comment-16758557 ] Mono Shiro commented on SPARK-24961: Spark Version 2.3.2. I have a very similar issue when simply reading a file that is bigger than the available memory on my machine. Changing the StorageLevel to DISK_ONLY also blows up despite having ample space. [Please see the question on stackoverflow|https://stackoverflow.com/questions/54469243/spark-storagelevel-in-local-mode-not-working/54470393#54470393] It's important that local mode work for these sort of things. > sort operation causes out of memory > > > Key: SPARK-24961 > URL: https://issues.apache.org/jira/browse/SPARK-24961 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.1 > Environment: Java 1.8u144+ > Windows 10 > Spark 2.3.1 in local mode > -Xms4g -Xmx4g > optional: -XX:+UseParallelOldGC >Reporter: Markus Breuer >Priority: Major > > A sort operation on large rdd - which does not fit in memory - causes out of > memory exception. I made the effect reproducable by an sample, the sample > creates large object of about 2mb size. When saving result the oom occurs. I > tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, > MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY > seems to work. > When replacing sort() with sortWithinPartitions() no StorageLevel is required > and application succeeds. > {code:java} > package de.bytefusion.examples; > import breeze.storage.Storage; > import de.bytefusion.utils.Options; > import org.apache.hadoop.io.MapFile; > import org.apache.hadoop.io.SequenceFile; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapred.SequenceFileOutputFormat; > import org.apache.spark.api.java.JavaRDD; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructType; > import org.apache.spark.storage.StorageLevel; > import scala.Tuple2; > import static org.apache.spark.sql.functions.*; > import java.util.ArrayList; > import java.util.List; > import java.util.UUID; > import java.util.stream.Collectors; > import java.util.stream.IntStream; > public class Example3 { > public static void main(String... args) { > // create spark session > SparkSession spark = SparkSession.builder() > .appName("example1") > .master("local[4]") > .config("spark.driver.maxResultSize","1g") > .config("spark.driver.memory","512m") > .config("spark.executor.memory","512m") > .config("spark.local.dir","d:/temp/spark-tmp") > .getOrCreate(); > JavaSparkContext sc = > JavaSparkContext.fromSparkContext(spark.sparkContext()); > // base to generate huge data > List list = new ArrayList<>(); > for (int val = 1; val < 1; val++) { > int valueOf = Integer.valueOf(val); > list.add(valueOf); > } > // create simple rdd of int > JavaRDD rdd = sc.parallelize(list,200); > // use map to create large object per row > JavaRDD rowRDD = > rdd > .map(value -> > RowFactory.create(String.valueOf(value), > createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024))) > // no persist => out of memory exception on write() > // persist MEMORY_AND_DISK => out of memory exception > on write() > // persist MEMORY_AND_DISK_SER => out of memory > exception on write() > // persist(StorageLevel.DISK_ONLY()) > ; > StructType type = new StructType(); > type = type > .add("c1", DataTypes.StringType) > .add( "c2", DataTypes.StringType ); > Dataset df = spark.createDataFrame(rowRDD, type); > // works > df.show(); > df = df > .sort(col("c1").asc() ) > ; > df.explain(); > // takes a lot of time but works > df.show(); > // OutOfMemoryError: java heap space > df > .write() > .mode("overwrite") > .csv("d:/temp/my.csv"); > // OutOfMemoryError: java heap space > df > .toJavaRDD() > .mapToPair(row -> new Tuple2(new Text(row.getString(0)), new > Text( row.getString(1 > .saveAsHadoopFile("d:\\temp\\foo", Text.class,
[jira] [Commented] (SPARK-24541) TCP based shuffle
[ https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758488#comment-16758488 ] Jose Torres commented on SPARK-24541: - I'm not gonna lie, I didn't put a tremendous amount of thought into the title of the Jira ticket. There's a strong argument that using Netty is indeed the right decision here. (Although we have to keep scalability in mind; we'll eventually need to do some kind of multiplexing to support even moderately sized N to N shuffles, so we should probably stay compatible with that.) I'd guess that the RPC framework does carry a performance penalty from things such as extra headers, but I'd argue the major disadvantage is that it's not the right abstraction layer. RPCs normally live exclusively in the control plane. > TCP based shuffle > - > > Key: SPARK-24541 > URL: https://issues.apache.org/jira/browse/SPARK-24541 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24541) TCP based shuffle
[ https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758482#comment-16758482 ] Imran Rashid commented on SPARK-24541: -- well, rpc is over tcp, so I'm still not really sure what this means. Is the point sending raw data directly over sockets? I'd be interested in knowing what the purpose is. I guess to avoid the overhead associated w/ the extra headers etc from the rpc framework? And if this is really going to try to use raw sockets, not through netty, then you'd have to reimplement encryption, manage your own buffers, etc. > TCP based shuffle > - > > Key: SPARK-24541 > URL: https://issues.apache.org/jira/browse/SPARK-24541 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758351#comment-16758351 ] Gabor Somogyi commented on SPARK-23685: --- [~sindiri] We've tried to reproduce the issue without success do you have an example code? > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)
[ https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758348#comment-16758348 ] Gabor Somogyi commented on SPARK-26783: --- [~zsxwing] [~kabhwan] The more I'm playing with the things the more I think there could be different issues involved (not sure have effect on each other). 1. "failOnDataLoss": I'll ask the reporter on SPARK-23685 because not yet able to repro. Let's see whether the code or the doc has to be updated. 2. Generic data source implementation issue. Namely API doesn't guarantee lowercase params but the user code is depending on that. For example [this|https://github.com/apache/spark/blob/aea5f506463c19fac97547ba7a28f9dd491e3a6a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L66] but there are other places. Not sure it has anything to do with the first but could cause potentially such issues. > Kafka parameter documentation doesn't match with the reality (upper/lowercase) > -- > > Key: SPARK-26783 > URL: https://issues.apache.org/jira/browse/SPARK-26783 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Minor > > A good example for this is "failOnDataLoss" which is reported in SPARK-23685. > I've just checked and there are several other parameters which suffer from > the same issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue
[ https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-26734: -- Component/s: DStreams > StackOverflowError on WAL serialization caused by large receivedBlockQueue > -- > > Key: SPARK-26734 > URL: https://issues.apache.org/jira/browse/SPARK-26734 > Project: Spark > Issue Type: Bug > Components: Block Manager, DStreams >Affects Versions: 2.3.1, 2.3.2, 2.4.0 > Environment: spark 2.4.0 streaming job > java 1.8 > scala 2.11.12 >Reporter: Ross M. Lodge >Priority: Major > > We encountered an intermittent StackOverflowError with a stack trace similar > to: > > {noformat} > Exception in thread "JobGenerator" java.lang.StackOverflowError > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat} > The name of the thread has been seen to be either "JobGenerator" or > "streaming-start", depending on when in the lifecycle of the job the problem > occurs. It appears to only occur in streaming jobs with checkpointing and > WAL enabled; this has prevented us from upgrading to v2.4.0. > > Via debugging, we tracked this down to allocateBlocksToBatch in > ReceivedBlockTracker: > {code:java} > /** > * Allocate all unallocated blocks to the given batch. > * This event will get written to the write ahead log (if enabled). > */ > def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { > if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { > val streamIdToBlocks = streamIds.map { streamId => > (streamId, getReceivedBlockQueue(streamId).clone()) > }.toMap > val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) > if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { > streamIds.foreach(getReceivedBlockQueue(_).clear()) > timeToAllocatedBlocks.put(batchTime, allocatedBlocks) > lastAllocatedBatchTime = batchTime > } else { > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } else { > // This situation occurs when: > // 1. WAL is ended with BatchAllocationEvent, but without > BatchCleanupEvent, > // possibly processed batch job or half-processed batch job need to be > processed again, > // so the batchTime will be equal to lastAllocatedBatchTime. > // 2. Slow checkpointing makes recovered batch time older than WAL > recovered > // lastAllocatedBatchTime. > // This situation will only occurs in recovery time. > logInfo(s"Possibly processed batch $batchTime needs to be processed again > in WAL recovery") > } > } > {code} > Prior to 2.3.1, this code did > {code:java} > getReceivedBlockQueue(streamId).dequeueAll(x => true){code} > but it was changed as part of SPARK-23991 to > {code:java} > getReceivedBlockQueue(streamId).clone(){code} > We've not been able to reproduce this in a test of the actual above method, > but we've been able to produce a test that reproduces it by putting a lot of > values into the queue: > > {code:java} > class SerializationFailureTest extends FunSpec { > private val logger = LoggerFactory.getLogger(getClass) > private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo] > describe("Queue") { > it("should be serializable") { > runTest(1062) > } > it("should not be serializable") { > runTest(1063) > } > it("should DEFINITELY not be serializable") { > runTest(199952) > } > } > private def runTest(mx: Int): Array[Byte] = { > try { > val random = new scala.util.Random() > val queue = new ReceivedBlockQueue() > for (_ <- 0 until mx) { > queue += ReceivedBlockInfo( > streamId = 0, > numRecords = Some(random.nextInt(5)), > metadataOption = None, > blockStoreResult = WriteAheadLogBasedStoreResult( > blockId = StreamBlockId(0, random.nextInt()), > numRecords = Some(random.nextInt(5)), > walRecordHandle = FileBasedWriteAheadLogSegment( > path = > s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""", > offset = random.nextLong(), > length = random.nextInt() > ) > ) > ) > } > val record = BatchAllocationEvent( > Time(154832040L), AllocatedBlocks(
[jira] [Created] (SPARK-26809) insert overwrite directory + concat function => error
ant_nebula created SPARK-26809: -- Summary: insert overwrite directory + concat function => error Key: SPARK-26809 URL: https://issues.apache.org/jira/browse/SPARK-26809 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: ant_nebula insert overwrite directory '/tmp/xx' select concat(col1, col2) from tableXX limit 3 Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements while columns.types has 2 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one
[ https://issues.apache.org/jira/browse/SPARK-26797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26797: Assignee: (was: Apache Spark) > Start using the new logical types API of Parquet 1.11.0 instead of the > deprecated one > - > > Key: SPARK-26797 > URL: https://issues.apache.org/jira/browse/SPARK-26797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Priority: Major > > The 1.11.0 release of parquet-mr will deprecate its logical type API in > favour of a newly introduced one. The new API also introduces new subtypes > for different timestamp semantics, support for which should be added to Spark > in order to read those types correctly. > At this point only a release candidate of parquet-mr 1.11.0 is available, but > that already allows implementing and reviewing this change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one
[ https://issues.apache.org/jira/browse/SPARK-26797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26797: Assignee: Apache Spark > Start using the new logical types API of Parquet 1.11.0 instead of the > deprecated one > - > > Key: SPARK-26797 > URL: https://issues.apache.org/jira/browse/SPARK-26797 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Apache Spark >Priority: Major > > The 1.11.0 release of parquet-mr will deprecate its logical type API in > favour of a newly introduced one. The new API also introduces new subtypes > for different timestamp semantics, support for which should be added to Spark > in order to read those types correctly. > At this point only a release candidate of parquet-mr 1.11.0 is available, but > that already allows implementing and reviewing this change. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
[ https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758102#comment-16758102 ] Gera Shegalov commented on SPARK-23155: --- [~kabhwan], [~vanzin] I would still be interested to be able to use the new mechanism with the old logs. [https://github.com/apache/spark/pull/23720] is a quick draft to demo how we could achieve this flexibly with named capture groups. > YARN-aggregated executor/driver logs appear unavailable when NM is down > --- > > Key: SPARK-23155 > URL: https://issues.apache.org/jira/browse/SPARK-23155 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > Unlike MapReduce JobHistory Server, Spark history server isn't rewriting > container log URL's to point to the aggregated yarn.log.server.url location > and relies on the NodeManager webUI to trigger a redirect. This fails when > the NM is down. Note that NM may be down permanently after decommissioning in > traditional environments or when used in a cloud environment such as AWS EMR > where either worker nodes are taken away with autoscale, the whole cluster is > used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI
[ https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758095#comment-16758095 ] Gera Shegalov commented on SPARK-26792: --- [~kabhwan] thanks for doing this work. I verified that I can configure SHS so it satisfies our use case. Changing the default in Spark is a nice-to-have but not a high priority from my perspective. > Apply custom log URL to Spark UI > > > Key: SPARK-26792 > URL: https://issues.apache.org/jira/browse/SPARK-26792 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed > apps. > While getting reviews from SPARK-23155, I've got two comments which applying > custom log URLs to UI would help achieving it. Quoting these comments here: > https://github.com/apache/spark/pull/23260#issuecomment-456827963 > {quote} > Sorry I haven't had time to look through all the code so this might be a > separate jira, but one thing I thought of here is it would be really nice not > to have specifically stderr/stdout. users can specify any log4j.properties > and some tools like oozie by default end up using hadoop log4j rather then > spark log4j, so files aren't necessarily the same. Also users can put in > other logs files so it would be nice to have links to those from the UI. It > seems simpler if we just had a link to the directory and it read the files > within there. Other things in Hadoop do it this way, but I'm not sure if that > works well for other resource managers, any thoughts on that? As long as this > doesn't prevent the above I can file a separate jira for it. > {quote} > https://github.com/apache/spark/pull/23260#issuecomment-456904716 > {quote} > Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We > typically configure Spark jobs to write the GC log and dump heap on OOM > using , and/or we use the rolling file appender to deal with > large logs during debugging. So linking the YARN container log overview > page would make much more sense for us. We work it around with a custom > submit process that logs all important URLs on the submit side log. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org