Paul Rogers created DRILL-5484:
----------------------------------
Summary: easy.text.compliant.RepeatedVarCharOutput creates
unnecessary 64K byte field
Key: DRILL-5484
URL: https://issues.apache.org/jira/browse/DRILL-5484
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers
Priority: Minor
The "Easy" text readers include a "complaint" reader for reading things like
CSV. That mechanism includes a class, {{RepeatedVarCharOutput}}, which gathers
field data into a single array, "columns".
Part of the work is to implement project by reading only needed columns. This
is done with a {{fields}} array. Since the constructor that sets up the array
does not know the number of fields, it guesses that there will be the maximum:
64K.
{code}
public static final int MAXIMUM_NUMBER_COLUMNS = 64 * 1024;
...
boolean[] fields = new boolean[MAXIMUM_NUMBER_COLUMNS];
{code}
This is, of course, a quick & dirty solution, but it is a bit of a heavy price
to pay for a single bit that indicates we want to read all field. It is not
clear that the performance advantage of a flag check is worth the cost of
having many 64K heap blocks allocated: we need one per file per reader.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)