Re: How to implement AbstractRecordWriter

Paul Rogers Fri, 31 May 2019 09:38:36 -0700

Hi Nicolas,

Charles outlined the choices quite well.

Let's talk about your observation that you find it annoying to deal with the 
full Drill code. There may be some tricks here that can help you.

As you know, I've been revising the text reader and the "EVF" (row set 
framework). Doing so requires a series of pull requests. To move fast, I've 
found the following workflow to be helpful:

* Use a machine with an SSD. A Mac is ideal. A Linux desktop also works (mine 
uses Linux Mint.) The SSD makes rebuilds very fast.

* Use unit tests for all your testing. For example, I created dozens of unit 
tests for CSV files to exercise the text reader, and many more to exercise the 
EVF. All development and testing consists of adding/changing code, 
adding/changing tests, and stepping through the unit test and underlying code 
to find bugs.

* Use JUnit categories to run selected unit tests as a group.

In most cases, you let your IDE do the build; you don't need Maven nor do you 
need to build jar files. Edit a file, run a unit test from your IDE and step 
through code. My edit/compile/debug cycle tends to be seconds.

If, however, you find yourself using Maven to build Drill, then are running 
unit tests from Maven, and attaching a debugger, then your edit/compile/debug 
cycle will be 5+ minutes, which is going to be irritating.

If you are doing a full build so you can use SqlLine to test, then this 
suggests it is time to write a unit test case for that issue so you can run it 
from the IDE. Using the RowSet stuff makes such tests easy. See 
TestCsvWithHeaders [1] for some examples.

If you run from the IDE, and find things don't work then perhaps there is a 
config issue. Do we have code that looks for a file in $DRILL_HOME/whatever 
rather than using the class path? Is a required native library not on the 
LD_LIBRARY_PATH for the IDE?

Most unit tests are designed to be stateless. They read a file stored in 
resources, or they write a test file, read the file, and discard the file when 
done.

You are using MapRDB to insert data, which, of course, is stateful. So, perhaps 
your test can put the DB into a known start state, insert some records, read 
those records, compare them with the expected results, and clean up the state 
so you are ready for the next test run. Your target is that edit/compile/debug 
cycle of a few seconds.

Overall, if you can master the art of running Drill, using unit tests, in your 
IDE, you can move forward very quickly.

Use Maven builds, and run tests via Maven, only when getting ready to submit a 
PR. If you change, say, only the contrib module, you only need build and test 
that module. If you also change exec, say, then you can just build those two 
modules.

To use categories, tag your tests as follows:

@Category(RowSetTests.class) class MyTest ...

(I'll send the Maven command line separately; I'm not on that machine at the 
moment.)

Thanks much to the team members who helped make this happen. I've since worked 
on other projects that don't have this power and it is truly a grueling 
experience to wait for long builds and deploys after ever change.

Thanks,
- Paul

[1] 
https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/store/easy/text/compliant/TestCsvWithHeaders.java

    On Friday, May 31, 2019, 5:17:40 AM PDT, Charles Givre <cgi...@gmail.com> 
wrote:  

 Hi Nicolas, 

You have two options:  
1.  You can develop format plugins and UDFs in Drill by adding them to the 
contrib/ folder and then test them with unit tests.  Take a look at this PR as 
an example[1].  If you're intending to submit your work to Drill for inclusion, 
this would be my recommendation as you can write the unit tests as you go, and 
it doesn't take very long to build and you can debug.
2.  Alternatively, you can package the code separately as shown here[2]. 
However, this option requires you to build it, then copy the jars over to 
DRILL_HOME/jars/3rd_party along with any dependencies, then run Drill.  I'm not 
sure how you could write unit tests this way. 

I hope this helps.

[1]: https://github.com/apache/drill/pull/1749
[2]: https://github.com/cgivre/drill-excel-plugin

> On May 31, 2019, at 8:06 AM, Nicolas A Perez <anicola...@gmail.com> wrote:
> 
> Paul,
> 
> Is it possible to develop my plugin outside of the drill code, let's say in
> my own repository and then package it and add it to the location where the
> plugins live? Does that work, too? I just find annoying to deal with the
> full drill code in order to develop a plugin. At the same time, I might
> want to detach the development of plugins from the drill life cycle itself.
> 
> Please advise.
> 
> Best Regards,
> 
> Nicolas A Perez
> 
> On Thu, May 30, 2019 at 9:58 PM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
> 
>> Hi Nicolas,
>> 
>> A quick check of the code suggests that AbstractWriter is a
>> Json-serialized description of the physical plan. It represents the
>> information sent from the planner to the execution engine, and is
>> interpreted by the scan operator. That is, it is the "physical plan."
>> 
>> The question is, how does the execution engine translate create the actual
>> writer based on the physical plan? The only good example seems to be for
>> the FileSystemPlugin. That particular storage plugin is complicated by the
>> additional layer of the format plugins.
>> 
>> There is a bit of magic here. Briefly, Drill uses a BatchCreator to create
>> your writer. It does so via some Java introspection magic. Drill looks for
>> all subclases of BatchCreator, the uses the type of the second argument to
>> the getBatch() method to find the correct class. This may mean that you
>> need to create one with MapRDBFormatPluginConfig as the type of the second
>> argument.
>> 
>> The getBatch() method then creates the CloseableRecordBatch
>> implementation. This is a full Drill operator, meaning it must handle the
>> Volcano iterator protocol. Looks like you can perhaps use WriterRecordBatch
>> as the writer operator itself. (See EasyWriterBatchCreator and follow the
>> code to understand the plumbing.)
>> 
>> You create a RecordWriter to do the actual work. AFAIK, MapRDB supports
>> JSON data model (at least in some form). If this is the version you are
>> working on, the fastest development path might just be to copy the
>> JsonRecordWriter, and replace the writes to JSON with writes to MapRDB. At
>> least this gives you a place to start looking.
>> 
>> 
>> A more general solution would be to build the writer using some of the
>> recent additions to Drill such as the row set mechanisms for reading a
>> record batch. But, since copying the JSON approach provides a quick & dirty
>> solution, perhaps that is good enough for this particular use case.
>> 
>> 
>> In our book, we recommend building each step one-by-one and doing a quick
>> test to verify that each step works as you expect. If you create your
>> BatchCreator, but not the writer, things won't actually work, but you can
>> set a breakpoint in the getBatch() method to verify the Drill did find your
>> class. And so on.
>> 
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Thursday, May 30, 2019, 3:05:39 AM PDT, Nicolas A Perez <
>> anicola...@gmail.com> wrote:
>> 
>> Can anyone give me an overview of how to implement AbstractRecordWriter?
>> 
>> What are the mechanics it follows, what should I do and so on? It will very
>> helpful.
>> 
>> Best Regards,
>> 
>> Nicolas A Perez
>> --
>> 
>> --------------------------------------------------------------------------------------------
>> Sent by Nicolas A Perez from my GMAIL account.
>> 
>> --------------------------------------------------------------------------------------------
>> 
> 
> 
> 
> -- 
> --------------------------------------------------------------------------------------------
> Sent by Nicolas A Perez from my GMAIL account.
> --------------------------------------------------------------------------------------------

Re: How to implement AbstractRecordWriter

Reply via email to