Re: How to implement AbstractRecordWriter

Paul Rogers Fri, 31 May 2019 10:38:54 -0700

Hi Nicolas,

Regarding your point that plugins should be, well, plugins -- independent of 
Drill code. Yes, that is true. But, no one has invested the time to make it so. 
Doing so would require a clear, stable code API; an easy way to develop such 
code without the need for the "build jar, copy to DRILL_HOME, restart Drill" 
approach that Charles mentioned.


There were some recent improvements around the bootstrap file, which is great. 
In the mean while, and since the MapR plugin code is already part of Drill, 
let's see if we can get the "work within Drill" approach to work for you. Then, 
perhaps you can use your experience to suggest changes that could be made to 
achieve the "true plugin" goal. All the Drill contributors who are not part of 
the core Drill team would likely very much appreciate a true plugin capability.


I use Eclipse, perhaps others who use IntelliJ can comment on the specifics of 
that IDE.

Drill is divided into modules: your code in the contrib module depends on Drill 
code in java-exec, vector and so on. When I run tests in java-exec in Eclipse, 
Eclipse automatically detects and rebuilds changes in dependent modules such as 
common or vector. This establishes that Eclipse, at least, understands Maven 
dependencies.


I seem to recall that I also got this to work when writing the Drill book when 
I created an example plugin in the contrib module. I don't recall having to 
change anything to get it to work. Perhaps others who have worked on other 
contrib modules can offer their experience.


So, one thing to check is if the Maven dependencies are configured correctly 
for the MapR plugin.

One issue which I thought we solved are test-time dependencies. Tim did some 
work to ensure that code in src/test is visible to downstream modules. Which 
symbols/constructs are causing you problems? Perhaps there is more to fix?

For now, perhaps you can target the goal of getting the existing MapR plugin 
code to work properly in the IDE. This is supposed to work, so it might just be 
a matter of resolving a few specific glitches.

Has anyone worked on the MapR DB plugin previously and can offer advice?

Thanks,
- Paul

 

    On Friday, May 31, 2019, 10:10:14 AM PDT, Nicolas A Perez 
<anicola...@gmail.com> wrote:  
 
 One of the issues I have is that I haven’t found a way to debug my tests
from intelliJ. It continues to say that some constructs from other modules
are missing.

Also, I haven’t  found *simple* examples of how to write *simple* tests.
Every time i look at the existing code, the tests are done in a different
way.

Now, on the other hand, pluggings should be independent from drill core
modules. If you think about, i can easily write a library that can be
injected into Spark without touching Spark code. For instance, the
DataSource API will load the required parts from my code at run time. Drill
does the same, but the problem is the coupling between drill and it’s
extension points.

On the tests side, you have another problem, you cannot easily tests your
new modules unless they are within drill core code. Maybe it is time to
decoupling the test framework from drill itself, too.

On Fri, May 31, 2019 at 18:38 Paul Rogers <par0...@yahoo.com.invalid> wrote:

> Hi Nicolas,
>
> Charles outlined the choices quite well.
>
> Let's talk about your observation that you find it annoying to deal with
> the full Drill code. There may be some tricks here that can help you.
>
> As you know, I've been revising the text reader and the "EVF" (row set
> framework). Doing so requires a series of pull requests. To move fast, I've
> found the following workflow to be helpful:
>
> * Use a machine with an SSD. A Mac is ideal. A Linux desktop also works
> (mine uses Linux Mint.) The SSD makes rebuilds very fast.
>
> * Use unit tests for all your testing. For example, I created dozens of
> unit tests for CSV files to exercise the text reader, and many more to
> exercise the EVF. All development and testing consists of adding/changing
> code, adding/changing tests, and stepping through the unit test and
> underlying code to find bugs.
>
> * Use JUnit categories to run selected unit tests as a group.
>
> In most cases, you let your IDE do the build; you don't need Maven nor do
> you need to build jar files. Edit a file, run a unit test from your IDE and
> step through code. My edit/compile/debug cycle tends to be seconds.
>
> If, however, you find yourself using Maven to build Drill, then are
> running unit tests from Maven, and attaching a debugger, then your
> edit/compile/debug cycle will be 5+ minutes, which is going to be
> irritating.
>
> If you are doing a full build so you can use SqlLine to test, then this
> suggests it is time to write a unit test case for that issue so you can run
> it from the IDE. Using the RowSet stuff makes such tests easy. See
> TestCsvWithHeaders [1] for some examples.
>
> If you run from the IDE, and find things don't work then perhaps there is
> a config issue. Do we have code that looks for a file in
> $DRILL_HOME/whatever rather than using the class path? Is a required native
> library not on the LD_LIBRARY_PATH for the IDE?
>
> Most unit tests are designed to be stateless. They read a file stored in
> resources, or they write a test file, read the file, and discard the file
> when done.
>
> You are using MapRDB to insert data, which, of course, is stateful. So,
> perhaps your test can put the DB into a known start state, insert some
> records, read those records, compare them with the expected results, and
> clean up the state so you are ready for the next test run. Your target is
> that edit/compile/debug cycle of a few seconds.
>
>
> Overall, if you can master the art of running Drill, using unit tests, in
> your IDE, you can move forward very quickly.
>
> Use Maven builds, and run tests via Maven, only when getting ready to
> submit a PR. If you change, say, only the contrib module, you only need
> build and test that module. If you also change exec, say, then you can just
> build those two modules.
>
> To use categories, tag your tests as follows:
>
> @Category(RowSetTests.class) class MyTest ...
>
> (I'll send the Maven command line separately; I'm not on that machine at
> the moment.)
>
>
> Thanks much to the team members who helped make this happen. I've since
> worked on other projects that don't have this power and it is truly a
> grueling experience to wait for long builds and deploys after ever change.
>
>
> Thanks,
> - Paul
>
> [1]
> https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/store/easy/text/compliant/TestCsvWithHeaders.java
>
>
>
>
>
>    On Friday, May 31, 2019, 5:17:40 AM PDT, Charles Givre <
> cgi...@gmail.com> wrote:
>
>  Hi Nicolas,
>
> You have two options:
> 1.  You can develop format plugins and UDFs in Drill by adding them to the
> contrib/ folder and then test them with unit tests.  Take a look at this PR
> as an example[1].  If you're intending to submit your work to Drill for
> inclusion, this would be my recommendation as you can write the unit tests
> as you go, and it doesn't take very long to build and you can debug.
> 2.  Alternatively, you can package the code separately as shown here[2].
> However, this option requires you to build it, then copy the jars over to
> DRILL_HOME/jars/3rd_party along with any dependencies, then run Drill.  I'm
> not sure how you could write unit tests this way.
>
> I hope this helps.
>
>
> [1]: https://github.com/apache/drill/pull/1749
> [2]: https://github.com/cgivre/drill-excel-plugin
>
>
> > On May 31, 2019, at 8:06 AM, Nicolas A Perez <anicola...@gmail.com>
> wrote:
> >
> > Paul,
> >
> > Is it possible to develop my plugin outside of the drill code, let's say
> in
> > my own repository and then package it and add it to the location where
> the
> > plugins live? Does that work, too? I just find annoying to deal with the
> > full drill code in order to develop a plugin. At the same time, I might
> > want to detach the development of plugins from the drill life cycle
> itself.
> >
> > Please advise.
> >
> > Best Regards,
> >
> > Nicolas A Perez
> >
> > On Thu, May 30, 2019 at 9:58 PM Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Nicolas,
> >>
> >> A quick check of the code suggests that AbstractWriter is a
> >> Json-serialized description of the physical plan. It represents the
> >> information sent from the planner to the execution engine, and is
> >> interpreted by the scan operator. That is, it is the "physical plan."
> >>
> >> The question is, how does the execution engine translate create the
> actual
> >> writer based on the physical plan? The only good example seems to be for
> >> the FileSystemPlugin. That particular storage plugin is complicated by
> the
> >> additional layer of the format plugins.
> >>
> >> There is a bit of magic here. Briefly, Drill uses a BatchCreator to
> create
> >> your writer. It does so via some Java introspection magic. Drill looks
> for
> >> all subclases of BatchCreator, the uses the type of the second argument
> to
> >> the getBatch() method to find the correct class. This may mean that you
> >> need to create one with MapRDBFormatPluginConfig as the type of the
> second
> >> argument.
> >>
> >> The getBatch() method then creates the CloseableRecordBatch
> >> implementation. This is a full Drill operator, meaning it must handle
> the
> >> Volcano iterator protocol. Looks like you can perhaps use
> WriterRecordBatch
> >> as the writer operator itself. (See EasyWriterBatchCreator and follow
> the
> >> code to understand the plumbing.)
> >>
> >> You create a RecordWriter to do the actual work. AFAIK, MapRDB supports
> >> JSON data model (at least in some form). If this is the version you are
> >> working on, the fastest development path might just be to copy the
> >> JsonRecordWriter, and replace the writes to JSON with writes to MapRDB.
> At
> >> least this gives you a place to start looking.
> >>
> >>
> >> A more general solution would be to build the writer using some of the
> >> recent additions to Drill such as the row set mechanisms for reading a
> >> record batch. But, since copying the JSON approach provides a quick &
> dirty
> >> solution, perhaps that is good enough for this particular use case.
> >>
> >>
> >> In our book, we recommend building each step one-by-one and doing a
> quick
> >> test to verify that each step works as you expect. If you create your
> >> BatchCreator, but not the writer, things won't actually work, but you
> can
> >> set a breakpoint in the getBatch() method to verify the Drill did find
> your
> >> class. And so on.
> >>
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Thursday, May 30, 2019, 3:05:39 AM PDT, Nicolas A Perez <
> >> anicola...@gmail.com> wrote:
> >>
> >> Can anyone give me an overview of how to implement AbstractRecordWriter?
> >>
> >> What are the mechanics it follows, what should I do and so on? It will
> very
> >> helpful.
> >>
> >> Best Regards,
> >>
> >> Nicolas A Perez
> >> --
> >>
> >>
> --------------------------------------------------------------------------------------------
> >> Sent by Nicolas A Perez from my GMAIL account.
> >>
> >>
> --------------------------------------------------------------------------------------------
> >>
> >
> >
> >
> > --
> >
> --------------------------------------------------------------------------------------------
> > Sent by Nicolas A Perez from my GMAIL account.
> >
> --------------------------------------------------------------------------------------------
>

-- 
Nicolas A Perez from GMAIL MOBILE

Re: How to implement AbstractRecordWriter

Reply via email to