[ 
https://issues.apache.org/jira/browse/DRILL-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17565411#comment-17565411
 ] 

PJ Fanning commented on DRILL-8096:
-----------------------------------

This is not implemented. excel-streaming-reader that Drill uses does now use 
ReadOnlySharedStringTable so that is one part of this issue that is already 
addressed - but supporting allowing users to choose the implemenation when 
using Drill is not yet supported. The feature is potentially useful but maybe 
better to wait till users start reporting issues about memory footprint before 
adding extra Drill features.

> format-excel reader: support different Shared String implementations
> --------------------------------------------------------------------
>
>                 Key: DRILL-8096
>                 URL: https://issues.apache.org/jira/browse/DRILL-8096
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Data Types
>            Reporter: PJ Fanning
>            Priority: Major
>
> One of the biggest users of memory and processing time when reading Excel 
> files is handling the Shared Strings Table.
> excel-streaming-reader v3.3.0 supports 3 implementations.
> I would suggest that Drill should use the ReadOnlySharedStringTable as the 
> default.
> Drill currently uses the full featured Apache POI SharedStringTable by 
> default (which requires more memory and parsing effort).
> There is also a TempFileSharedStringTable which uses a temp file to keep the 
> data out of heap memory. This is still pretty fast because it is implemented 
> using a H2 database MVMap.
> If supporting allowing users configure which implementation they want sounds 
> useful, I can do a PR.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to