I’d like to weigh in here as well. As a long time user of Drill, I really would like to see more people using it and I think there are a few key aspects that could really help on that front.
The first of which is the Arrow integration. I’m not enough of a software engineer to understand all the internal details here, but as I understand it, the promise of Arrow is that many tools will share a common memory model and that it will be possible to transfer data from one tool to the other without having to serialize/deserialize the data. In the data science community many of the major platforms, Python-pandas, R, and Spark are moving or have adopted Arrow. Drill’s strength is the ease that it can query many different data sources and if Drill were to adopt Arrow, I suspect that many people would adopt it as a part of a machine learning pipeline. Just recently, I attempted to do some data manipulation using Spark, and couldn’t help but notice how difficult ti was in contrast with Drill. I’m sure this is a very complex task, but I do think that it could be worth it in the end. Secondly, I’d like to second Paul’s call to simplify the interfaces for UDFs, Format and ideally storage plugins. A core strength of Drill is its extensibility and making it easier would be a great thing. I was wondering whether it would be possible or even a good idea, to enable users to write UDFs in a scripting language such as python. Thirdly, i would really like to see us add more functionality to Drill. @Arina, your work to build a storage plugin for ElasticSearch is really great and I think more capabilities like that are really needed. I’d like to see a generic HTTP storage plugin, a storage plugin for Google Sheets, If I can figure out how storage plugins work, I’ll gladly work on some of these. Just my .02. — C > On Aug 13, 2018, at 21:21, Paul Rogers <[email protected]> wrote: > > Hi Arina, > > Another topic would be whether/how to round out Drill's data model. Drill's > scalar and nullable types are pretty solid. Great work was done recently for > Decimal (though the old types still remain.) Good support is now available > for nested types to do implicit joins to produce SQL-friendly flat records. > But, opportunities for improvement still remain. Date/Time has timezone > issues. Union, List and Repeated List never quite worked. There are a few > types identified in the code, but not implemented (dates with TZ, tiny ints, > etc.) How should Drill bridge. the gap from arrays and maps (really, structs) > on the one hand, and plain-old-relational ODBC/JDBC/BI tools on the other? > > Would be good to finalize the data types and their mapping to plain SQL: > either keep a type and make it fully work if it has holes, or drop it. Unions > and Lists are the messiest. They are incomplete in part, because they are > trying to do the impossible: to predict the future well enough that Drill can > handle columns with varying or ambiguous data types (that is, to handle > schema changes.) Is there a better way to handle this issue (such as with > metadata hints)? That is, rather than fight with conflicting types at run > time, simply declare the common type in metadata so all operators and record > batches agree on the type. > > And, of course, there is the lingering issue of Drill vectors vs. Arrow. > Arrow did great work in metadata, but seems to have kept some of the awkward > aspects of Drill's original memory model (lack of control over batch sizes, > ability to fragment memory.) Might there be a resyncing of the two projects: > Drill picks up Arrow's metadata and APIs, Arrow picks up Drill's memory > improvements, such as the size-limiting "result set loader" framework. > > Big-picture issues such as this tend to get lost in the 2270 open Jira > tickets. How might the project create some "theme" tickets (or Wiki pages or > whatever) to help pull the main issues out of the wealth of detail in Jira? > > Thanks, > - Paul > > > > On Monday, August 13, 2018, 11:07:39 AM PDT, Paul Rogers > <[email protected]> wrote: > > Hi Arina, > > Thanks for launching this discussion. A few minor suggestions. > > The developers have done a fantastic job stabilizing and improving Drill's > core functionality. Now the opportunity is to expand the use cases for Drill > so that it gets wider adoption within the community. Drill competes for > mindshare with Impala, Presto, Hive, Spark and others. A key differentiator > for Drill can be the ability to extend the core and integrate Drill into user > applications. Of these tools, only Spark has a fully ostensible model. Can > Drill provide some of the flexibility that has powered Spark to success? > > 1. You mentioned the metastore is under active investigation. Anything yet to > share? Didn't see any activity on the JIRA ticket. Metadata is a key gap in > Drill. Simply adding a Hive-like metastore would repeat the very errors that > Drill was meant to address. Maybe we can toss around ideas for a metadata API > that provides greater flexibility. > > 2. Users can extend the core with custom UDFs, storage engines, formats and > so on. At present, the code to do this is rather hard to write, debug and > maintain. Is there value in streamlining those interfaces so that a wider > audience can extend Drill for their specific needs? > > 3. Similarly, we've seen interest in integrating Drill with other systems, > which suggests an opportunity for improved APIs. Ability to associate > options, defaults and restrictions with users. Ability to use the REST API > for larger data sets and with stateful session options. And so on. > > Such extensions are best guided by user demands: what can Drill provide for > production applications to enable simpler/faster/more complete integration? > > Thanks, > > - Paul > > > > On Monday, August 13, 2018, 5:42:08 AM PDT, Arina Ielchiieva > <[email protected]> wrote: > > Hi all, > > as a new PMC Chair I would like to thank users for choosing and using > Apache Drill and contributors / committers for making improvements and > fixes. Recently Apache Drill 1.14 was released bundled up with many > improvements and new features. Please feel free to try it out and share > your experience. As always we would love to hear your success stories of > using Apache Drill. > > Also I encourage users to share any problems found in Drill, as well as any > suggestions for future improvements. Feel free to start discussion on the > mailing list and then file a Jira with the summary. Contributions are > always welcome: minor, major, doc improvements or grammar fixes. Just file > a Jira and open the PR. Do not hesitate to ping developers on the mailing > list if PR is not being timely reviewed. > > Latest project reports show: > Apache Drill project has healthy release schedule, each release includes > lots of features. > Mailing list (user / dev) are getting substantial support from the active > developers, including Stackoverflow and Twitter. > New committers are added on the steady basis. > > Overall project is growing and moving forward. There have been discussions > about Drill 2.0 last year and currently Drill metastore feature is under > active investigation which might the breaking change for 2.0. > > Please feel free to reply to this email with your comments / concerns / > ideas about current project state. > > Kind regards, > Arina
