Re: [DISCUSSION] current project state

Charles Givre Mon, 13 Aug 2018 20:38:23 -0700

I’d like to weigh in here as well. As a long time user of Drill, I really would 
like to see more people using it and I think there are a few key aspects that 
could really help on that front.


The first of which is the Arrow integration.  I’m not enough of a software 
engineer to understand all the internal details here, but as I understand it, 
the promise of Arrow is that many tools will share a common memory model and 
that it will be possible to transfer data from one tool to the other without 
having to serialize/deserialize the data.  In the data science community many 
of the major platforms, Python-pandas, R, and Spark are moving or have adopted 
Arrow.  
Drill’s strength is the ease that it can query many different data sources and 
if Drill were to adopt Arrow, I suspect that many people would adopt it as a 
part of a machine learning pipeline.  Just recently, I attempted to do some 
data manipulation using Spark, and couldn’t help but notice how difficult ti 
was in contrast with Drill. I’m sure this is a very complex task, but I do 
think that it could be worth it in the end. 

Secondly, I’d like to second Paul’s call to simplify the interfaces for UDFs, 
Format and ideally storage plugins.  A core strength of Drill is its 
extensibility and making it easier would be a great thing.  I was wondering 
whether it would be possible or even a good idea, to enable users to write UDFs 
in a scripting language such as python. 

Thirdly, 
i would really like to see us add more functionality to Drill.  @Arina, your 
work to build a storage plugin for ElasticSearch is really great and I think 
more capabilities like that are really needed.  I’d like to see a generic HTTP 
storage plugin, a storage plugin for Google Sheets,  If I can figure out how 
storage plugins work, I’ll gladly work on some of these. 

Just my .02.
— C





> On Aug 13, 2018, at 21:21, Paul Rogers <[email protected]> wrote:
> 
> Hi Arina,
> 
> Another topic would be whether/how to round out Drill's data model. Drill's 
> scalar and nullable types are pretty solid. Great work was done recently for 
> Decimal (though the old types still remain.) Good support is now available 
> for nested types to do implicit joins to produce SQL-friendly flat records. 
> But, opportunities for improvement still remain. Date/Time has timezone 
> issues. Union, List and Repeated List never quite worked. There are a few 
> types identified in the code, but not implemented (dates with TZ, tiny ints, 
> etc.) How should Drill bridge. the gap from arrays and maps (really, structs) 
> on the one hand, and plain-old-relational ODBC/JDBC/BI tools on the other?
> 
> Would be good to finalize the data types and their mapping to plain SQL: 
> either keep a type and make it fully work if it has holes, or drop it. Unions 
> and Lists are the messiest. They are incomplete in part, because they are 
> trying to do the impossible: to predict the future well enough that Drill can 
> handle columns with varying or ambiguous data types (that is, to handle 
> schema changes.) Is there a better way to handle this issue (such as with 
> metadata hints)? That is, rather than fight with conflicting types at run 
> time, simply declare the common type in metadata so all operators and record 
> batches agree on the type.
> 
> And, of course, there is the lingering issue of Drill vectors vs. Arrow. 
> Arrow did great work in metadata, but seems to have kept some of the awkward 
> aspects of Drill's original memory model (lack of control over batch sizes, 
> ability to fragment memory.) Might there be a resyncing of the two projects: 
> Drill picks up Arrow's metadata and APIs, Arrow picks up Drill's memory 
> improvements, such as the size-limiting "result set loader" framework.
> 
> Big-picture issues such as this tend to get lost in the 2270 open Jira 
> tickets. How might the project create some "theme" tickets (or Wiki pages or 
> whatever) to help pull the main issues out of the wealth of detail in Jira?
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers 
> <[email protected]> wrote:  
> 
> Hi Arina,
> 
> Thanks for launching this discussion. A few minor suggestions.
> 
> The developers have done a fantastic job stabilizing and improving Drill's 
> core functionality. Now the opportunity is to expand the use cases for Drill 
> so that it gets wider adoption within the community. Drill competes for 
> mindshare with Impala, Presto, Hive, Spark and others. A key differentiator 
> for Drill can be the ability to extend the core and integrate Drill into user 
> applications. Of these tools, only Spark has a fully ostensible model. Can 
> Drill provide some of the flexibility that has powered Spark to success?
> 
> 1. You mentioned the metastore is under active investigation. Anything yet to 
> share? Didn't see any activity on the JIRA ticket. Metadata is a key gap in 
> Drill. Simply adding a Hive-like metastore would repeat the very errors that 
> Drill was meant to address. Maybe we can toss around ideas for a metadata API 
> that provides greater flexibility.
> 
> 2. Users can extend the core with custom UDFs, storage engines, formats and 
> so on. At present, the code to do this is rather hard to write, debug and 
> maintain. Is there value in streamlining those interfaces so that a wider 
> audience can extend Drill for their specific needs?
> 
> 3. Similarly, we've seen interest in integrating Drill with other systems, 
> which suggests an opportunity for improved APIs. Ability to associate 
> options, defaults and restrictions with users. Ability to use the REST API 
> for larger data sets and with stateful session options. And so on.
> 
> Such extensions are best guided by user demands: what can Drill provide for 
> production applications to enable simpler/faster/more complete integration?  
> 
> Thanks,
> 
> - Paul
> 
> 
> 
>    On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva 
> <[email protected]> wrote:  
> 
> Hi all,
> 
> as a new PMC Chair I would like to thank users for choosing and using
> Apache Drill and contributors /  committers for making improvements and
> fixes. Recently Apache Drill 1.14 was released bundled up with many
> improvements and new features. Please feel free to try it out and share
> your experience. As always we would love to hear your success stories of
> using Apache Drill.
> 
> Also I encourage users to share any problems found in Drill, as well as any
> suggestions for future improvements. Feel free to start discussion on the
> mailing list and then file a Jira with the summary. Contributions are
> always welcome: minor, major, doc improvements or grammar fixes. Just file
> a Jira and open the PR. Do not hesitate to ping developers on the mailing
> list if PR is not being timely reviewed.
> 
> Latest project reports show:
> Apache Drill project has healthy release schedule, each release includes
> lots of features.
> Mailing list (user / dev) are getting substantial support from the active
> developers, including Stackoverflow and Twitter.
> New committers are added on the steady basis.
> 
> Overall project is growing and moving forward. There have been discussions
> about Drill 2.0 last year and currently Drill metastore feature is under
> active investigation which might the breaking change for 2.0.
> 
> Please feel free to reply to this email with your comments / concerns /
> ideas about current project state.
> 
> Kind regards,
> Arina

Re: [DISCUSSION] current project state

Reply via email to