[
https://issues.apache.org/jira/browse/ORC-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16446397#comment-16446397
]
Dain Sundstrom commented on ORC-350:
------------------------------------
I think we should add a truncated flag to the stats, that way the writer can
simply chop the data when large. The PPD system can still use the prefix value
to filter min/max ranges.
Also, we should add support min/max for varbinary. In SQL varbinary doesn't
mean random binary data, it means bytes without a specified character encoding.
It is common for people to use varbinary to avoid expensive unnecessary
character encodings, but the data is still usable with min/max (and dictionary
encodings).
> Optionally disable/specify indexes for columns
> ----------------------------------------------
>
> Key: ORC-350
> URL: https://issues.apache.org/jira/browse/ORC-350
> Project: ORC
> Issue Type: Sub-task
> Reporter: Prasanth Jayachandran
> Priority: Major
>
> There are many cases where entire xml or big json is stored as string column.
> If we autogenerate indexes on those columns, we often run into issues with
> protobuf stream explosion. The only workaround for now is to change from
> string to binary. It will be good to have an option to disable indexes on
> specific columns.
> Regardless, I think we should have max limits on string column statistics. If
> that limit is exceeded PPD should handle it accordingly (by returning
> YES_NO_NULL).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)