Dongjoon Hyun created SPARK-25635:
-------------------------------------

             Summary: Support selective direct encoding in native ORC write
                 Key: SPARK-25635
                 URL: https://issues.apache.org/jira/browse/SPARK-25635
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Dongjoon Hyun


Before ORC 1.5.3, `orc.dictionary.key.threshold` and 
`hive.exec.orc.dictionary.key.size.threshold` is applied for all columns. This 
is a big huddle to enable dictionary encoding.

>From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct 
>encoding selectively in a column-wise manner. This issue aims to add that 
>feature by upgrading ORC from 1.5.2 to 1.5.3.

The followings are the patches in ORC 1.5.3 and this feature is the only one 
related to Spark directly.
{code}
ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts 
multi-byte data (gopalv)
ORC-403: [C++] Add checks to avoid invalid offsets in InputStream
ORC-405. Remove calcite as a dependency from the benchmarks.
ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places.
ORC-383: Parallel builds fails with ConcurrentModificationException
ORC-382: Apache rat exclusions + add rat check to travis
ORC-401: Fix incorrect quoting in specification.
ORC-385. Change RecordReader to extend Closeable.
ORC-384: [C++] fix memory leak when loading non-ORC files
ORC-391: [c++] parseType does not accept underscore in the field name
ORC-397. Allow selective disabling of dictionary encoding. Original patch was 
by Mithun Radhakrishnan.
ORC-389: Add ability to not decode Acid metadata columns
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to