[jira] [Commented] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

Owen O'Malley (Jira) Thu, 22 Aug 2019 13:17:36 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913665#comment-16913665
 ]


Owen O'Malley commented on SPARK-27594:
---------------------------------------

This is being caused by an ORC bug that was backported in the Hortonworks' 
version of ORC.

> spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be 
> read incorrectly
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27594
>                 URL: https://issues.apache.org/jira/browse/SPARK-27594
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Jan-Willem van der Sijp
>            Priority: Major
>
> Using {{spark.sql.orc.impl=native}} and 
> {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP 
> columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, 
> the milliseconds of time timestamp will be doubled.
> Input/output of a Zeppelin session to demonstrate:
> {code:python}
> %pyspark
> from pprint import pprint
> spark.conf.set("spark.sql.orc.impl", "native")
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> pprint(spark.sparkContext.getConf().getAll())
> --------------------
> [('sql.stacktrace', 'false'),
>  ('spark.eventLog.enabled', 'true'),
>  ('spark.app.id', 'application_1556200632329_0005'),
>  ('importImplicit', 'true'),
>  ('printREPLOutput', 'true'),
>  ('spark.history.ui.port', '18081'),
>  ('spark.driver.extraLibraryPath',
>   
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
>  ('spark.driver.extraJavaOptions',
>   ' -Dfile.encoding=UTF-8 '
>   
> '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties
>  '
>   
> '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
>  ('concurrentSQL', 'false'),
>  ('spark.driver.port', '40195'),
>  ('spark.executor.extraLibraryPath',
>   
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
>  ('useHiveContext', 'true'),
>  ('spark.jars',
>   
> 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
>  ('spark.history.provider',
>   'org.apache.spark.deploy.history.FsHistoryProvider'),
>  ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
>  ('spark.submit.deployMode', 'client'),
>  ('spark.ui.filters',
>   'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
>  
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
>   'sandbox-hdp.hortonworks.com'),
>  ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
>  ('spark.repl.class.uri', 
> 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
>  ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
>  ('master', 'yarn'),
>  ('spark.yarn.dist.archives',
>   '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
>  ('spark.scheduler.mode', 'FAIR'),
>  ('spark.yarn.queue', 'default'),
>  ('spark.history.kerberos.keytab',
>   '/etc/security/keytabs/spark.headless.keytab'),
>  ('spark.executor.id', 'driver'),
>  ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
>  ('spark.history.kerberos.enabled', 'false'),
>  ('spark.master', 'yarn'),
>  ('spark.sql.catalogImplementation', 'hive'),
>  ('spark.history.kerberos.principal', 'none'),
>  ('spark.driver.extraClassPath',
>   
> ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
>  ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
>  ('spark.repl.class.outputDir',
>   '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
>  ('spark.yarn.isPython', 'true'),
>  ('spark.app.name', 'Zeppelin'),
>  
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
>   
> 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
>  ('maxResult', '1000'),
>  ('spark.executorEnv.PYTHONPATH',
>   
> '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'),
>  ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
> {code}
> {code:python}
> %pyspark
> spark.sql("""
> DROP TABLE IF EXISTS default.hivetest
> """)
> spark.sql("""
> CREATE TABLE default.hivetest (
>     day DATE,
>     time TIMESTAMP,
>     timestring STRING
> )
> USING ORC
> """)
> {code}
> {code:python}
> %pyspark
> df1 = spark.createDataFrame(
>     [
>         ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123")
>     ],
>     schema=("date", "timestamp", "string")
> )
> df2 = spark.createDataFrame(
>     [
>         ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234")
>     ],
>     schema=("date", "timestamp", "string")
> )
> {code}
> {code:python}
> %pyspark
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> df1.write.insertInto("default.hivetest")
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
> df1.write.insertInto("default.hivetest")
> {code}
> {code:python}
> %pyspark
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> spark.read.table("default.hivetest").show(2, False)
> """
> +----------+-----------------------+-----------------------+
> |day       |time                   |timestring             |
> +----------+-----------------------+-----------------------+
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> +----------+-----------------------+-----------------------+
> """
> {code}
> {code:python}
> %pyspark
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
> spark.read.table("default.hivetest").show(2, False)
> """
> +----------+-----------------------+-----------------------+
> |day       |time                   |timestring             |
> +----------+-----------------------+-----------------------+
> |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> +----------+-----------------------+-----------------------+
> """
> {code}
> {code:scala}
> import spark.sql
> import spark.implicits._
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> sql("SELECT * FROM default.hivetest").show(2, false)
> """
> import spark.sql
> import spark.implicits._
> +----------+-----------------------+-----------------------+
> |day       |time                   |timestring             |
> +----------+-----------------------+-----------------------+
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> +----------+-----------------------+-----------------------+
> """
> {code}
> Querying using HIVE produces the correct data also:
> {code:sql}
> select * from default.hivetest;
> day       |time                   |timestring             |
> ----------|-----------------------|-----------------------|
> 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

Reply via email to