Hi Charles,
Better APIs are always a good thing!
The EVF ManagedReader interface has the minimum common denominator API: open,
next (batch) and close.
We can create extensions that provide more structure such as with your
EasyEVFReader. For example: open() might: 1) fiddle with the DrillFileSystem to
open the file and seek to the block start location, 2) do something with
schema, 3) set up required shims. next() could do your steps 2-6: For example:
next() {
setupBatch();
while (!loader.isFull()) {
if (!readRow()) { break; }
loadRow();
}
finalizeBatch();
}
Many readers now use reader-specific "shims" to map from input columns/types to
EVF column writers. In this case, the above loadRow() can be defined to iterate
over the shims.
The trick has always been that each reader is a bit different: different
objects are used, slightly different logic. It seemed simpler to let each
reader create its own code structure rather than create an elaborate general
structure that folks must learn.
I'm looking forward to see what you created.
Thanks,
- Paul
On Sunday, January 26, 2020, 8:41:56 AM PST, Charles Givre
<[email protected]> wrote:
Hello all
I wanted to share something that I’m working on and ask for feedback. I
started working on converting the LTSV format plugin to EVF and basically was
able to do that pretty quickly. This is a relatively simple format in that it
has one data type and no complex fields.
Instead of just doing the conversion I wanted to see if we could put some more
abstraction on the format plugin architecture that would make it easier for
people to build format plugins without having to learn the various Drill
internals. I’m still working on the coding and will share once it is more
presentable. Basically I realized that every format plugin is at a high level
the same.
It has to
1. Open the input source
2. Read that data in
3. Parse that data into rows
4. Parse the rows into fields
5. Map the fields into Drill structures
6. Stop when it runs out of data.
Steps 1 and 2 are virtually identical for every format plugin and hence that
was the low hanging fruit 🍎. Steps 3-5 sounded like an iterator to me and step
6 again was something that could be hidden.
So what I did was write an abstract class called EasyEVFReader which abstracts
virtually all of the file operations. It also includes utility functions for
schema definition (more on that later) and column mapping. Basically all the
developer has to do is
1. Create an iterator class that reads the data and maps it to the rows
2. Extend the EasyEVFReader class and assign the iterator to a variable.
I’ll share the code tonight or tomorrow but I wanted to ask what people think
about the general approach. My goal was to get rid of the cut/paste code that
exists in so many plugins and greatly simplify the process.
Thanks!
Sent from my iPhone