[ https://issues.apache.org/jira/browse/DRILL-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17474660#comment-17474660 ]
Charles Givre commented on DRILL-8096: -------------------------------------- [~pj.fanning] this would be very helpful to add as well. We'd certainly welcome a PR for that! > format-excel reader: support different Shared String implementations > -------------------------------------------------------------------- > > Key: DRILL-8096 > URL: https://issues.apache.org/jira/browse/DRILL-8096 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Data Types > Reporter: PJ Fanning > Priority: Major > > One of the biggest users of memory and processing time when reading Excel > files is handling the Shared Strings Table. > excel-streaming-reader v3.3.0 supports 3 implementations. > I would suggest that Drill should use the ReadOnlySharedStringTable as the > default. > Drill currently uses the full featured Apache POI SharedStringTable by > default (which requires more memory and parsing effort). > There is also a TempFileSharedStringTable which uses a temp file to keep the > data out of heap memory. This is still pretty fast because it is implemented > using a H2 database MVMap. > If supporting allowing users configure which implementation they want sounds > useful, I can do a PR. > -- This message was sent by Atlassian Jira (v8.20.1#820001)