[ 
https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068670#comment-17068670
 ] 

ASF GitHub Bot commented on DRILL-7641:
---------------------------------------

cgivre commented on pull request #2024: DRILL-7641: Convert Excel Reader to use 
Streaming Reader
URL: https://github.com/apache/drill/pull/2024#discussion_r399246782
 
 

 ##########
 File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
 ##########
 @@ -267,14 +269,21 @@ private XSSFSheet getSheet() {
 
   /**
    * Returns the column count.  There are a few gotchas here in that we have 
to know the header row and count the physical number of cells
-   * in that row.  Since the user can define the header row,
+   * in that row.  This function also has to move the rowIterator object to 
the first row of data.
    * @return The number of actual columns
    */
   private int getColumnCount() {
+    // Initialize
+    currentRow = rowIterator.next();
     int rowNumber = readerConfig.headerRow > 0 ? sheet.getFirstRowNum() : 0;
-    XSSFRow sheetRow = sheet.getRow(rowNumber);
 
-    return sheetRow != null ? sheetRow.getPhysicalNumberOfCells() : 0;
+    // If the headerRow is greater than zero, advance the iterator to the 
first row of data
+    // This is unfortunately necessary since the streaming reader eliminated 
the getRow() method.
+    for(int i = 1; i < rowNumber; i++) {
 
 Review comment:
   Fixed
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Convert Excel Reader to Use Streaming Reader
> --------------------------------------------
>
>                 Key: DRILL-7641
>                 URL: https://issues.apache.org/jira/browse/DRILL-7641
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Text &amp; CSV
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.18.0
>
>
> The current implementation of the Excel reader uses the Apache POI reader, 
> which uses excessive amounts of memory. As a result, attempting to read large 
> Excel files will cause out of memory errors. 
> This PR converts the format plugin to use a streaming reader, based still on 
> the POI library.  The documentation for the streaming reader can be found 
> here. [1]
> All unit tests pass and I tested the plugin with some large Excel files on my 
> computer.
> [1]: [https://github.com/pjfanning/excel-streaming-reader]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to