Re: [DISCUSS] Format Plugin Interface

Paul Rogers Sun, 26 Jan 2020 18:21:05 -0800

Hi Charles,

Better APIs are always a good thing!


The EVF ManagedReader interface has the minimum common denominator API: open, 
next (batch) and close.

We can create extensions that provide more structure such as with your 
EasyEVFReader. For example: open() might: 1) fiddle with the DrillFileSystem to 
open the file and seek to the block start location, 2) do something with 
schema, 3) set up required shims. next() could do your steps 2-6: For example:


next() {
  setupBatch();
  while (!loader.isFull()) {
    if (!readRow()) { break; }
    loadRow();
  }
  finalizeBatch();
}

Many readers now use reader-specific "shims" to map from input columns/types to 
EVF column writers. In this case, the above loadRow() can be defined to iterate 
over the shims.

The trick has always been that each reader is a bit different: different 
objects are used, slightly different logic. It seemed simpler to let each 
reader create its own code structure rather than create an elaborate general 
structure that folks must learn.

I'm looking forward to see what you created.

Thanks,
- Paul

 

    On Sunday, January 26, 2020, 8:41:56 AM PST, Charles Givre 
<[email protected]> wrote:  
 
 Hello all
I wanted to share something that I’m working on and ask for feedback.  I 
started working on converting the LTSV format plugin to EVF and basically was 
able to do that pretty quickly.  This is a relatively simple format in that it 
has one data type and no complex fields. 

Instead of just doing the conversion I wanted to see if we could put some more 
abstraction on the format plugin architecture that would make it easier for 
people to build format plugins without having to learn the various Drill 
internals.  I’m still working on the coding and will share once it is more 
presentable. Basically I realized that every format plugin is at a high level 
the same.  
It has to 
1.  Open the input source 
2.  Read that data in
3.  Parse that data into rows
4.  Parse the rows into fields
5. Map the fields into Drill structures
6. Stop when it runs out of data. 

Steps 1 and 2 are virtually identical for every format plugin and hence that 
was the low hanging fruit 🍎. Steps 3-5 sounded like an iterator to me and step 
6 again was something that could be hidden.  

So what I did was write an abstract class called EasyEVFReader which abstracts 
virtually all of the file operations.  It also includes utility functions for 
schema definition (more on that later) and column mapping.  Basically all the 
developer has to do is
1. Create an iterator class that reads the data and maps it to the rows 
2. Extend the EasyEVFReader class and assign the iterator to a variable.  

I’ll share the code tonight or tomorrow but I wanted to ask what people think 
about the general approach.  My goal was to get rid of the cut/paste code that 
exists in so many plugins and greatly simplify the process. 
Thanks!

Sent from my iPhone

Re: [DISCUSS] Format Plugin Interface

Reply via email to