[ https://issues.apache.org/jira/browse/SPARK-27594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913665#comment-16913665 ]
Owen O'Malley commented on SPARK-27594: --------------------------------------- This is being caused by an ORC bug that was backported in the Hortonworks' version of ORC. > spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be > read incorrectly > -------------------------------------------------------------------------------------------- > > Key: SPARK-27594 > URL: https://issues.apache.org/jira/browse/SPARK-27594 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Jan-Willem van der Sijp > Priority: Major > > Using {{spark.sql.orc.impl=native}} and > {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP > columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, > the milliseconds of time timestamp will be doubled. > Input/output of a Zeppelin session to demonstrate: > {code:python} > %pyspark > from pprint import pprint > spark.conf.set("spark.sql.orc.impl", "native") > spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") > pprint(spark.sparkContext.getConf().getAll()) > -------------------- > [('sql.stacktrace', 'false'), > ('spark.eventLog.enabled', 'true'), > ('spark.app.id', 'application_1556200632329_0005'), > ('importImplicit', 'true'), > ('printREPLOutput', 'true'), > ('spark.history.ui.port', '18081'), > ('spark.driver.extraLibraryPath', > > '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), > ('spark.driver.extraJavaOptions', > ' -Dfile.encoding=UTF-8 ' > > '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties > ' > > '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'), > ('concurrentSQL', 'false'), > ('spark.driver.port', '40195'), > ('spark.executor.extraLibraryPath', > > '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), > ('useHiveContext', 'true'), > ('spark.jars', > > 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), > ('spark.history.provider', > 'org.apache.spark.deploy.history.FsHistoryProvider'), > ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'), > ('spark.submit.deployMode', 'client'), > ('spark.ui.filters', > 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), > > ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', > 'sandbox-hdp.hortonworks.com'), > ('spark.eventLog.dir', 'hdfs:///spark2-history/'), > ('spark.repl.class.uri', > 'spark://sandbox-hdp.hortonworks.com:40195/classes'), > ('spark.driver.host', 'sandbox-hdp.hortonworks.com'), > ('master', 'yarn'), > ('spark.yarn.dist.archives', > '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'), > ('spark.scheduler.mode', 'FAIR'), > ('spark.yarn.queue', 'default'), > ('spark.history.kerberos.keytab', > '/etc/security/keytabs/spark.headless.keytab'), > ('spark.executor.id', 'driver'), > ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'), > ('spark.history.kerberos.enabled', 'false'), > ('spark.master', 'yarn'), > ('spark.sql.catalogImplementation', 'hive'), > ('spark.history.kerberos.principal', 'none'), > ('spark.driver.extraClassPath', > > ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), > ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'), > ('spark.repl.class.outputDir', > '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'), > ('spark.yarn.isPython', 'true'), > ('spark.app.name', 'Zeppelin'), > > ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', > > 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'), > ('maxResult', '1000'), > ('spark.executorEnv.PYTHONPATH', > > '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'), > ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')] > {code} > {code:python} > %pyspark > spark.sql(""" > DROP TABLE IF EXISTS default.hivetest > """) > spark.sql(""" > CREATE TABLE default.hivetest ( > day DATE, > time TIMESTAMP, > timestring STRING > ) > USING ORC > """) > {code} > {code:python} > %pyspark > df1 = spark.createDataFrame( > [ > ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123") > ], > schema=("date", "timestamp", "string") > ) > df2 = spark.createDataFrame( > [ > ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234") > ], > schema=("date", "timestamp", "string") > ) > {code} > {code:python} > %pyspark > spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") > df1.write.insertInto("default.hivetest") > spark.conf.set("spark.sql.orc.enableVectorizedReader", "false") > df1.write.insertInto("default.hivetest") > {code} > {code:python} > %pyspark > spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") > spark.read.table("default.hivetest").show(2, False) > """ > +----------+-----------------------+-----------------------+ > |day |time |timestring | > +----------+-----------------------+-----------------------+ > |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| > |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| > +----------+-----------------------+-----------------------+ > """ > {code} > {code:python} > %pyspark > spark.conf.set("spark.sql.orc.enableVectorizedReader", "false") > spark.read.table("default.hivetest").show(2, False) > """ > +----------+-----------------------+-----------------------+ > |day |time |timestring | > +----------+-----------------------+-----------------------+ > |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| > |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| > +----------+-----------------------+-----------------------+ > """ > {code} > {code:scala} > import spark.sql > import spark.implicits._ > spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") > sql("SELECT * FROM default.hivetest").show(2, false) > """ > import spark.sql > import spark.implicits._ > +----------+-----------------------+-----------------------+ > |day |time |timestring | > +----------+-----------------------+-----------------------+ > |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| > |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123| > +----------+-----------------------+-----------------------+ > """ > {code} > Querying using HIVE produces the correct data also: > {code:sql} > select * from default.hivetest; > day |time |timestring | > ----------|-----------------------|-----------------------| > 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| > 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123| > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org