[jira] [Updated] (DRILL-7423) Create More Efficient Way to Read Excel Cells

2019-10-28 Thread Charles Givre (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Givre updated DRILL-7423:
-
Description: 
The Excel format plugin reads cells but there are ways to make the reading 
process more efficient.  Since the schema of an Excel file is not known in 
advance, Drill must read the first row of data in order to extract the schema.  

It is actually a bit more complex.  To read the schema, Drill must first read 
the header rows and convert them all into Strings.  This gets us the header 
names if present.

Drill cannot create writers until it actually reads the first row of data where 
it will determine the data types.  This creates an inefficiency in that when 
Drill is writing the columns, it has to do a hash lookup for each column.  
Since the columns are in a fixed order, it may be possible to store the writers 
in an array and gain some efficiency there.

Also at present, if the columns are heterogenous, Drill requires the user to 
use allTextMode to query the data.  It would be nice if Drill could query the 
data w/o having to set that.

  was:
The Excel format plugin reads cells but there are ways to make the reading 
process more efficient.  Since the schema of an Excel file is not known in 
advance, Drill must read the first row of data in order to extract the schema.  

It is actually a bit more complex.  To read the schema, Drill must first read 
the header rows and convert them all into Strings.  This gets us the header 
names if present.

Drill cannot create writers until it actually reads the first row of data where 
it will determine the data types.  This creates an inefficiency in that when 
Drill is writing the columns, it has to do a hash lookup for each column.  
Since the columns are in a fixed order, it may be possible to store the writers 
in a 


> Create More Efficient Way to Read Excel Cells
> -
>
> Key: DRILL-7423
> URL: https://issues.apache.org/jira/browse/DRILL-7423
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Priority: Major
>
> The Excel format plugin reads cells but there are ways to make the reading 
> process more efficient.  Since the schema of an Excel file is not known in 
> advance, Drill must read the first row of data in order to extract the 
> schema.  
> It is actually a bit more complex.  To read the schema, Drill must first read 
> the header rows and convert them all into Strings.  This gets us the header 
> names if present.
> Drill cannot create writers until it actually reads the first row of data 
> where it will determine the data types.  This creates an inefficiency in that 
> when Drill is writing the columns, it has to do a hash lookup for each 
> column.  Since the columns are in a fixed order, it may be possible to store 
> the writers in an array and gain some efficiency there.
> Also at present, if the columns are heterogenous, Drill requires the user to 
> use allTextMode to query the data.  It would be nice if Drill could query the 
> data w/o having to set that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7423) Create More Efficient Way to Read Excel Cells

2019-10-28 Thread Charles Givre (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Givre updated DRILL-7423:
-
Description: 
The Excel format plugin reads cells but there are ways to make the reading 
process more efficient.  Since the schema of an Excel file is not known in 
advance, Drill must read the first row of data in order to extract the schema.  

It is actually a bit more complex.  To read the schema, Drill must first read 
the header rows and convert them all into Strings.  This gets us the header 
names if present.

Drill cannot create writers until it actually reads the first row of data where 
it will determine the data types.  This creates an inefficiency in that when 
Drill is writing the columns, it has to do a hash lookup for each column.  
Since the columns are in a fixed order, it may be possible to store the writers 
in a 

  was:The Excel format plugin reads cells but there are ways to make the 
reading process more efficient.  


> Create More Efficient Way to Read Excel Cells
> -
>
> Key: DRILL-7423
> URL: https://issues.apache.org/jira/browse/DRILL-7423
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Priority: Major
>
> The Excel format plugin reads cells but there are ways to make the reading 
> process more efficient.  Since the schema of an Excel file is not known in 
> advance, Drill must read the first row of data in order to extract the 
> schema.  
> It is actually a bit more complex.  To read the schema, Drill must first read 
> the header rows and convert them all into Strings.  This gets us the header 
> names if present.
> Drill cannot create writers until it actually reads the first row of data 
> where it will determine the data types.  This creates an inefficiency in that 
> when Drill is writing the columns, it has to do a hash lookup for each 
> column.  Since the columns are in a fixed order, it may be possible to store 
> the writers in a 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)