[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105640#comment-17105640 ] Owen O'Malley commented on SPARK-25557: --- Are there missing pieces on the ORC side, [~dongjoon]? > ORC predicate pushdown for nested fields > > > Key: SPARK-25557 > URL: https://issues.apache.org/jira/browse/SPARK-25557 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly
[ https://issues.apache.org/jira/browse/SPARK-27594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913665#comment-16913665 ] Owen O'Malley commented on SPARK-27594: --- This is being caused by an ORC bug that was backported in the Hortonworks' version of ORC. > spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be > read incorrectly > > > Key: SPARK-27594 > URL: https://issues.apache.org/jira/browse/SPARK-27594 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jan-Willem van der Sijp >Priority: Major > > Using {{spark.sql.orc.impl=native}} and > {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP > columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, > the milliseconds of time timestamp will be doubled. > Input/output of a Zeppelin session to demonstrate: > {code:python} > %pyspark > from pprint import pprint > spark.conf.set("spark.sql.orc.impl", "native") > spark.conf.set("spark.sql.orc.enableVectorizedReader", "true") > pprint(spark.sparkContext.getConf().getAll()) > > [('sql.stacktrace', 'false'), > ('spark.eventLog.enabled', 'true'), > ('spark.app.id', 'application_1556200632329_0005'), > ('importImplicit', 'true'), > ('printREPLOutput', 'true'), > ('spark.history.ui.port', '18081'), > ('spark.driver.extraLibraryPath', > > '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), > ('spark.driver.extraJavaOptions', > ' -Dfile.encoding=UTF-8 ' > > '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties > ' > > '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'), > ('concurrentSQL', 'false'), > ('spark.driver.port', '40195'), > ('spark.executor.extraLibraryPath', > > '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'), > ('useHiveContext', 'true'), > ('spark.jars', > > 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), > ('spark.history.provider', > 'org.apache.spark.deploy.history.FsHistoryProvider'), > ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'), > ('spark.submit.deployMode', 'client'), > ('spark.ui.filters', > 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), > > ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', > 'sandbox-hdp.hortonworks.com'), > ('spark.eventLog.dir', 'hdfs:///spark2-history/'), > ('spark.repl.class.uri', > 'spark://sandbox-hdp.hortonworks.com:40195/classes'), > ('spark.driver.host', 'sandbox-hdp.hortonworks.com'), > ('master', 'yarn'), > ('spark.yarn.dist.archives', > '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'), > ('spark.scheduler.mode', 'FAIR'), > ('spark.yarn.queue', 'default'), > ('spark.history.kerberos.keytab', > '/etc/security/keytabs/spark.headless.keytab'), > ('spark.executor.id', 'driver'), > ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'), > ('spark.history.kerberos.enabled', 'false'), > ('spark.master', 'yarn'), > ('spark.sql.catalogImplementation', 'hive'), > ('spark.history.kerberos.principal', 'none'), > ('spark.driver.extraClassPath', > > ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'), > ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'), > ('spark.repl.class.outputDir', > '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'), > ('spark.yarn.isPython', 'true'), > ('spark.app.name', 'Zeppelin'), > > ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', > > 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'), > ('maxResult', '1000'), > ('spark.executorEnv.PYTHONPATH', > > '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip{{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.6-src.zip'), > ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')] > {code} > {code:python} > %pyspark > spark.sql(""" > DROP TABLE IF EXISTS default.hivetest > """) > spark.sql(""" > CREATE TABLE default.hivetest ( > day DATE, > time TIMESTAMP, > timestring STRING > ) > USING ORC > """) > {code} > {code:python} > %pyspark > df1 = spark.createDataFrame( > [ > ("2019-01-01", "2019-01-01
[jira] [Created] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.
Owen O'Malley created SPARK-28208: - Summary: When upgrading to ORC 1.5.6, the reader needs to be closed. Key: SPARK-28208 URL: https://issues.apache.org/jira/browse/SPARK-28208 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Owen O'Malley As part of the ORC 1.5.6 release, we optimized the common pattern of: {code:java} Reader reader = OrcFile.createReader(...); RecordReader rows = reader.rows(...);{code} which used to open one file handle in the Reader and a second one in the RecordReader. Users were seeing this as a regression when moving from the old Spark ORC reader via hive to the new native reader, because it opened twice as many files on the NameNode. In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in the Reader until it is either closed or a RecordReader is created from it. This has cut down the number of file open requests on the NameNode by half in typical spark applications. (Hive's ORC code avoided this problem by putting the file footer in to the input splits, but that has other problems.) To get the new optimization without leaking file handles, Spark needs to be close the readers that aren't used to create RecordReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution
Owen O'Malley created SPARK-27913: - Summary: Spark SQL's native ORC reader implements its own schema evolution Key: SPARK-27913 URL: https://issues.apache.org/jira/browse/SPARK-27913 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.3 Reporter: Owen O'Malley ORC's reader handles a wide range of schema evolution, but the Spark SQL native ORC bindings do not provide the desired schema to the ORC reader. This causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated SPARK-20202: -- Priority: Blocker (was: Critical) It is against Apache policy to release binaries that aren't part of your project. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Blocker > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953315#comment-15953315 ] Owen O'Malley commented on SPARK-20202: --- I should also say here that the Hive community is willing to help. We are in the process of rolling releases so if Spark needs a change, we can work together to get this done. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298 ] Owen O'Malley commented on SPARK-20202: --- As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953298#comment-15953298 ] Owen O'Malley edited comment on SPARK-20202 at 4/3/17 11:16 AM: As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it needs to formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. was (Author: owen.omalley): As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20202) Remove references to org.spark-project.hive
Owen O'Malley created SPARK-20202: - Summary: Remove references to org.spark-project.hive Key: SPARK-20202 URL: https://issues.apache.org/jira/browse/SPARK-20202 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.6.4, 2.0.3, 2.1.1 Reporter: Owen O'Malley Priority: Blocker Fix For: 1.6.4, 2.0.3, 2.1.1 Spark can't continue to depend on their fork of Hive and must move to standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901769#comment-15901769 ] Owen O'Malley commented on SPARK-15474: --- Ok, Hive's use is fine because it gets the schema from the metastore and only matters for schema evolution, which isn't relevant if there are no rows. In fact, it gets worse in newer versions of Hive where the OrcOutputFormat will write 0 byte files and OrcInputFormat will ignore 0 bytes files for reading. (The reason behind needing the files at all are an interesting bit of Hive history, but not relevant for this.) The real fix is that Spark needs to use OrcFile.createWriter(...) API to write the files rather than Hive's OrcOutputFormat. The OrcFile API lets the caller set the schema directly. > ORC data source fails to write and read back empty dataframe > - > > Key: SPARK-15474 > URL: https://issues.apache.org/jira/browse/SPARK-15474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently ORC data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).limit(0) > emptyDf.write > .format("orc") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("orc") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at > /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da. > It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892) > at > org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > For Parquet and JSON, it works but ORC does not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0
[ https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089936#comment-15089936 ] Owen O'Malley commented on SPARK-1693: -- Can you explain what the problem is and how to fix it? We are hitting the same problem on the hive on spark work. > Dependent on multiple versions of servlet-api jars lead to throw an > SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 > > > Key: SPARK-1693 > URL: https://issues.apache.org/jira/browse/SPARK-1693 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Blocker > Fix For: 1.0.0 > > Attachments: log.txt > > > {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > > log.txt{code} > The log: > {code} > UnpersistSuite: > - unpersist RDD *** FAILED *** > java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s > signer information does not match signer information of other classes in the > same package > at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) > at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666) > at java.lang.ClassLoader.defineClass(ClassLoader.java:794) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org