[
https://issues.apache.org/jira/browse/DRILL-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Givre updated DRILL-7423:
-
Description:
The Excel format plugin reads cells but there are ways to make the reading
process more efficient. Since the schema of an Excel file is not known in
advance, Drill must read the first row of data in order to extract the schema.
It is actually a bit more complex. To read the schema, Drill must first read
the header rows and convert them all into Strings. This gets us the header
names if present.
Drill cannot create writers until it actually reads the first row of data where
it will determine the data types. This creates an inefficiency in that when
Drill is writing the columns, it has to do a hash lookup for each column.
Since the columns are in a fixed order, it may be possible to store the writers
in an array and gain some efficiency there.
Also at present, if the columns are heterogenous, Drill requires the user to
use allTextMode to query the data. It would be nice if Drill could query the
data w/o having to set that.
was:
The Excel format plugin reads cells but there are ways to make the reading
process more efficient. Since the schema of an Excel file is not known in
advance, Drill must read the first row of data in order to extract the schema.
It is actually a bit more complex. To read the schema, Drill must first read
the header rows and convert them all into Strings. This gets us the header
names if present.
Drill cannot create writers until it actually reads the first row of data where
it will determine the data types. This creates an inefficiency in that when
Drill is writing the columns, it has to do a hash lookup for each column.
Since the columns are in a fixed order, it may be possible to store the writers
in a
> Create More Efficient Way to Read Excel Cells
> -
>
> Key: DRILL-7423
> URL: https://issues.apache.org/jira/browse/DRILL-7423
> Project: Apache Drill
> Issue Type: Improvement
>Affects Versions: 1.18.0
>Reporter: Charles Givre
>Priority: Major
>
> The Excel format plugin reads cells but there are ways to make the reading
> process more efficient. Since the schema of an Excel file is not known in
> advance, Drill must read the first row of data in order to extract the
> schema.
> It is actually a bit more complex. To read the schema, Drill must first read
> the header rows and convert them all into Strings. This gets us the header
> names if present.
> Drill cannot create writers until it actually reads the first row of data
> where it will determine the data types. This creates an inefficiency in that
> when Drill is writing the columns, it has to do a hash lookup for each
> column. Since the columns are in a fixed order, it may be possible to store
> the writers in an array and gain some efficiency there.
> Also at present, if the columns are heterogenous, Drill requires the user to
> use allTextMode to query the data. It would be nice if Drill could query the
> data w/o having to set that.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)