Do you not use the catalog?
On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash <and...@andrewash.com> wrote: > I would support moving ORC from sql/hive -> sql/core because it brings me > one step closer to eliminating Hive from my Spark distribution by removing > -Phive at build time. > > On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun <dh...@hortonworks.com> > wrote: > >> Thank you again for coming and reviewing this PR. >> >> >> >> So far, we discussed the followings. >> >> >> >> 1. `Why are we adding this to core? Why not just the hive module?` (@rxin) >> >> - `sql/core` module gives more benefit than `sql/hive`. >> >> - Apache ORC library (`no-hive` version) is a general and resonably >> small library designed for non-hive apps. >> >> >> >> 2. `Can we add smaller amount of new code to use this, too?` (@kiszk) >> >> - The previous #17980 , #17924, and #17943 are the complete examples >> containing this PR. >> >> - This PR is focusing on dependency only. >> >> >> >> 3. `Why don't we then create a separate orc module? Just copy a few of >> the files over?` (@rxin) >> >> - Apache ORC library is the same with most of other data sources(CSV, >> JDBC, JSON, PARQUET, TEXT) which live inside `sql/core` >> >> - It's better to use as a library instead of copying ORC files because >> Apache ORC shaded jar has many files. We had better depend on Apache ORC >> community's effort until an unavoidable reason for copying occurs. >> >> >> >> 4. `I do worry in the future whether ORC would bring in a lot more jars` >> (@rxin) >> >> - The ORC core library's dependency tree is aggressively kept as small >> as possible. I've gone through and excluded unnecessary jars from our >> dependencies. I also kick back pull requests that add unnecessary new >> dependencies. (@omalley) >> >> >> >> 5. `In the long term, Spark should move to using only the vectorized >> reader in ORC's core” (@omalley) >> >> - Of course. >> >> >> >> I’ve been waiting for new comments and discussion since last week. >> >> Apparently, there is no further comments except the last comment(5) from >> Owen in this week. >> >> >> >> Please give your opinion if you think we need some change on the current >> PR (as-is). >> >> FYI, there is one LGTM on the PR (as-is) and no -1 so far. >> >> >> >> Thank you again for supporting new ORC improvement in Apache Spark. >> >> >> >> Bests, >> >> Dongjoon. >> >> >> >> >> >> *From: *Dong Joon Hyun <dh...@hortonworks.com> >> *Date: *Friday, August 4, 2017 at 8:05 AM >> *To: *"dev@spark.apache.org" <dev@spark.apache.org> >> *Cc: *Apache Spark PMC <priv...@spark.apache.org> >> *Subject: *Use Apache ORC in Apache Spark 2.3 >> >> >> >> Hi, All. >> >> >> >> Apache Spark always has been a fast and general engine, and >> >> supports Apache ORC inside `sql/hive` module with Hive dependency since >> Spark 1.4.X (SPARK-2883). >> >> However, there are many open issues about `Feature parity for ORC with >> Parquet (SPARK-20901)` as of today. >> >> >> >> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get >> the following benefits. >> >> >> >> - Usability: >> >> * Users can use `ORC` data sources without hive module (-Phive) >> like `Parquet` format. >> >> >> >> - Stability & Maintanability: >> >> * ORC 1.4 already has many fixes. >> >> * In the future, Spark can upgrade ORC library independently from >> Hive >> (similar to Parquet library, too) >> >> * Eventually, reduce the dependecy on old Hive 1.2.1. >> >> >> >> - Speed: >> >> * Last but not least, Spark can use both Spark `ColumnarBatch` >> and ORC `RowBatch` together >> >> which means full vectorization support. >> >> >> >> First of all, I'd love to improve Apache Spark in the following steps in >> the time frame of Spark 2.3. >> >> >> >> - SPARK-21422: Depend on Apache ORC 1.4.0 >> >> - SPARK-20682: Add a new faster ORC data source based on Apache ORC >> >> - SPARK-20728: Make ORCFileFormat configurable between sql/hive and >> sql/core >> >> - SPARK-16060: Vectorized Orc Reader >> >> >> >> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release, >> >> but the PRs seems to need more attention of PMC since this is an >> important change. >> >> Since the discussion on Apache Spark 2.3 cadence is already started this >> week, >> >> I thought it’s a best time to ask you about this. >> >> >> >> Could anyone of you help me to proceed ORC improvement in Apache Spark >> community? >> >> >> >> Please visit the minimal PR and JIRA issue as a starter. >> >> >> >> - https://github.com/apache/spark/pull/18640 >> - https://issues.apache.org/jira/browse/SPARK-21422 >> >> >> >> Thank you in advance. >> >> >> >> Bests, >> >> Dongjoon Hyun. >> > >