[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields

2020-05-12 Thread Owen O'Malley (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105640#comment-17105640
 ] 

Owen O'Malley commented on SPARK-25557:
---

Are there missing pieces on the ORC side, [~dongjoon]?

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

2019-08-22 Thread Owen O'Malley (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913665#comment-16913665
 ] 

Owen O'Malley commented on SPARK-27594:
---

This is being caused by an ORC bug that was backported in the Hortonworks' 
version of ORC.

> spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be 
> read incorrectly
> 
>
> Key: SPARK-27594
> URL: https://issues.apache.org/jira/browse/SPARK-27594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jan-Willem van der Sijp
>Priority: Major
>
> Using {{spark.sql.orc.impl=native}} and 
> {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP 
> columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, 
> the milliseconds of time timestamp will be doubled.
> Input/output of a Zeppelin session to demonstrate:
> {code:python}
> %pyspark
> from pprint import pprint
> spark.conf.set("spark.sql.orc.impl", "native")
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> pprint(spark.sparkContext.getConf().getAll())
> 
> [('sql.stacktrace', 'false'),
>  ('spark.eventLog.enabled', 'true'),
>  ('spark.app.id', 'application_1556200632329_0005'),
>  ('importImplicit', 'true'),
>  ('printREPLOutput', 'true'),
>  ('spark.history.ui.port', '18081'),
>  ('spark.driver.extraLibraryPath',
>   
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
>  ('spark.driver.extraJavaOptions',
>   ' -Dfile.encoding=UTF-8 '
>   
> '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties
>  '
>   
> '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
>  ('concurrentSQL', 'false'),
>  ('spark.driver.port', '40195'),
>  ('spark.executor.extraLibraryPath',
>   
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
>  ('useHiveContext', 'true'),
>  ('spark.jars',
>   
> 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
>  ('spark.history.provider',
>   'org.apache.spark.deploy.history.FsHistoryProvider'),
>  ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
>  ('spark.submit.deployMode', 'client'),
>  ('spark.ui.filters',
>   'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
>  
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
>   'sandbox-hdp.hortonworks.com'),
>  ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
>  ('spark.repl.class.uri', 
> 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
>  ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
>  ('master', 'yarn'),
>  ('spark.yarn.dist.archives',
>   '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
>  ('spark.scheduler.mode', 'FAIR'),
>  ('spark.yarn.queue', 'default'),
>  ('spark.history.kerberos.keytab',
>   '/etc/security/keytabs/spark.headless.keytab'),
>  ('spark.executor.id', 'driver'),
>  ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
>  ('spark.history.kerberos.enabled', 'false'),
>  ('spark.master', 'yarn'),
>  ('spark.sql.catalogImplementation', 'hive'),
>  ('spark.history.kerberos.principal', 'none'),
>  ('spark.driver.extraClassPath',
>   
> ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
>  ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
>  ('spark.repl.class.outputDir',
>   '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
>  ('spark.yarn.isPython', 'true'),
>  ('spark.app.name', 'Zeppelin'),
>  
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
>   
> 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
>  ('maxResult', '1000'),
>  ('spark.executorEnv.PYTHONPATH',
>   
> '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip{{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.6-src.zip'),
>  ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
> {code}
> {code:python}
> %pyspark
> spark.sql("""
> DROP TABLE IF EXISTS default.hivetest
> """)
> spark.sql("""
> CREATE TABLE default.hivetest (
> day DATE,
> time TIMESTAMP,
> timestring STRING
> )
> USING ORC
> """)
> {code}
> {code:python}
> %pyspark
> df1 = spark.createDataFrame(
> [
> ("2019-01-01", "2019-01-01 

[jira] [Created] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-06-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created SPARK-28208:
-

 Summary: When upgrading to ORC 1.5.6, the reader needs to be 
closed.
 Key: SPARK-28208
 URL: https://issues.apache.org/jira/browse/SPARK-28208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Owen O'Malley


As part of the ORC 1.5.6 release, we optimized the common pattern of:
{code:java}
Reader reader = OrcFile.createReader(...);
RecordReader rows = reader.rows(...);{code}

which used to open one file handle in the Reader and a second one in the 
RecordReader. Users were seeing this as a regression when moving from the old 
Spark ORC reader via hive to the new native reader, because it opened twice as 
many files on the NameNode.

In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
the Reader until it is either closed or a RecordReader is created from it. This 
has cut down the number of file open requests on the NameNode by half in 
typical spark applications. (Hive's ORC code avoided this problem by putting 
the file footer in to the input splits, but that has other problems.)

To get the new optimization without leaking file handles, Spark needs to be 
close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2019-05-31 Thread Owen O'Malley (JIRA)
Owen O'Malley created SPARK-27913:
-

 Summary: Spark SQL's native ORC reader implements its own schema 
evolution
 Key: SPARK-27913
 URL: https://issues.apache.org/jira/browse/SPARK-27913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.3
Reporter: Owen O'Malley


ORC's reader handles a wide range of schema evolution, but the Spark SQL native 
ORC bindings do not provide the desired schema to the ORC reader. This causes a 
regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated SPARK-20202:
--
Priority: Blocker  (was: Critical)

It is against Apache policy to release binaries that aren't part of your 
project.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Blocker
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953315#comment-15953315
 ] 

Owen O'Malley commented on SPARK-20202:
---

I should also say here that the Hive community is willing to help. We are in 
the process of rolling releases so if Spark needs a change,  we can work 
together to get this done.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298
 ] 

Owen O'Malley commented on SPARK-20202:
---

As an Apache member, the Spark project can't release binary artifacts that 
aren't made from its Apache code base. So either, the Spark project needs to 
use Hive's release artifacts or it formally fork Hive and move the fork into 
its git repository at Apache and rename it away from org.apache.hive to 
org.apache.spark. The current path is not allowed.

Hive is in the middle of rolling releases and thus this is a good time to make 
requests. The old uber jar (hive-exec) is already released separately with the 
classifier "core." It looks like we are using the same protobuf (2.5.0) and 
kryo (3.0.3) versions.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298
 ] 

Owen O'Malley edited comment on SPARK-20202 at 4/3/17 11:16 AM:


As an Apache member, the Spark project can't release binary artifacts that 
aren't made from its Apache code base. So either, the Spark project needs to 
use Hive's release artifacts or it needs to formally fork Hive and move the 
fork into its git repository at Apache and rename it away from org.apache.hive 
to org.apache.spark. The current path is not allowed.

Hive is in the middle of rolling releases and thus this is a good time to make 
requests. The old uber jar (hive-exec) is already released separately with the 
classifier "core." It looks like we are using the same protobuf (2.5.0) and 
kryo (3.0.3) versions.


was (Author: owen.omalley):
As an Apache member, the Spark project can't release binary artifacts that 
aren't made from its Apache code base. So either, the Spark project needs to 
use Hive's release artifacts or it formally fork Hive and move the fork into 
its git repository at Apache and rename it away from org.apache.hive to 
org.apache.spark. The current path is not allowed.

Hive is in the middle of rolling releases and thus this is a good time to make 
requests. The old uber jar (hive-exec) is already released separately with the 
classifier "core." It looks like we are using the same protobuf (2.5.0) and 
kryo (3.0.3) versions.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Critical
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20202) Remove references to org.spark-project.hive

2017-04-03 Thread Owen O'Malley (JIRA)
Owen O'Malley created SPARK-20202:
-

 Summary: Remove references to org.spark-project.hive
 Key: SPARK-20202
 URL: https://issues.apache.org/jira/browse/SPARK-20202
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.6.4, 2.0.3, 2.1.1
Reporter: Owen O'Malley
Priority: Blocker
 Fix For: 1.6.4, 2.0.3, 2.1.1


Spark can't continue to depend on their fork of Hive and must move to standard 
Hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-08 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901769#comment-15901769
 ] 

Owen O'Malley commented on SPARK-15474:
---

Ok, Hive's use is fine because it gets the schema from the metastore and only 
matters for schema evolution, which isn't relevant if there are no rows.

In fact, it gets worse in newer versions of Hive where the OrcOutputFormat will 
write 0 byte files and OrcInputFormat will ignore 0 bytes files for reading. 
(The reason behind needing the files at all are an interesting bit of Hive 
history, but not relevant for this.)

The real fix is that Spark needs to use OrcFile.createWriter(...) API to write 
the files rather than Hive's OrcOutputFormat. The OrcFile API lets the caller 
set the schema directly.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0

2016-01-08 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089936#comment-15089936
 ] 

Owen O'Malley commented on SPARK-1693:
--

Can you explain what the problem is and how to fix it? We are hitting the same 
problem on the hive on spark work.

> Dependent on multiple versions of servlet-api jars lead to throw an 
> SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 
> 
>
> Key: SPARK-1693
> URL: https://issues.apache.org/jira/browse/SPARK-1693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: log.txt
>
>
> {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > 
> log.txt{code}
> The log: 
> {code}
> UnpersistSuite:
> - unpersist RDD *** FAILED ***
>   java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s 
> signer information does not match signer information of other classes in the 
> same package
>   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
>   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org