from:"Ted Dunning"

Re: License headers inside Javadoc comments

2024-01-27 Thread Ted Dunning

The right way to get a copyright on every page is to tweak the javadoc
command to use a different template (I would think).



On Fri, Jan 26, 2024 at 12:00 AM Paul Rogers  wrote:

> Hi James,
>
> For some reason, Drill started with the license headers in Javadoc
> comments. The (weak) explanation I got was that we never generate Javadoc,
> so it didn't really matter. Later, we started converting the headers to
> regular comments when convenient.
>
> If we were to generate Javadoc, having the license at the top of each page
> as the summary for each class would probably not be something that anyone
> finds useful.
>
> I don't know how to configure the license plugin. But, I do suspect a
> Python file (or shell script) could make a one-time pass over the files to
> standardize headers into whatever format the team chooses. Only the first
> line of each file would change.
>
> - Paul
>
> On Thu, Jan 25, 2024 at 11:22 PM James Turton  wrote:
>
> > Good morning!
> >
> > I'd like to ask about a feature to prevent RAT from allowing license
> > headers to appear inside Javadoc comments  (/**) while still requiring
> > them in Java comments (/*) in .java files. Currently the Drill project
> > makes use of com.mycila.license-maven-plugin to reject licenses in
> > Javadoc comments because the developers at the time didn't want license
> > headers cluttering the Javadoc website that is generated from the
> > source. Are you aware of  a general view on Apache license headers
> > appearing in Javadoc pages? If preventing them from doing so is a good
> > idea, could this become a (configurable) feature in RAT?
> >
> > Thanks
> > James Turton
> >
>

Re: A deadline for Drill + Daffodil Integration - ApacheCon in Oct.

2023-07-06 Thread Ted Dunning

That's a cool abstract.



On Thu, Jul 6, 2023 at 8:29 AM Mike Beckerle  wrote:

> I decided the only way to force getting this Drill + Daffodil integration
> done, or at least started, is to have a deadline.
>
> So I submitted this abstract below for the upcoming "Community over Code"
> (formerly known as ApacheCon) conference this fall (Oct 7-10)
>
> I'm hoping this forces some of the refactoring that is gating other efforts
> and fixes in Daffodil at the same time.
>
> *Direct Query of Arbitrary Data Formats using Apache Drill and Apache
> Daffodil*
>
>
> *Suppose you have data in an ad-hoc data format like **EDIFACT, ISO8583,
> Asterix, some COBOL FD, or any other kind of data. **You can now describe
> it with a Data Format Description Language (DFDL) schema, then u**sing
> Apache Drill, you can directly query that data and those queries can also
> incorporate data from any of Apache Drill's other array of data sources.
> **This
> talk will describe the integration of Apache Drill with Apache Daffodil's
> DFDL implementation. **This deep integration implements Drill's metadata
> model in terms of the Daffodil DFDL metadata model, and implements Drill's
> data model in terms of the Daffodil DFDL Infoset API. This enables Drill
> queries to operate intelligently on DFDL-described data without the cost of
> data conversion into an expensive intermediate form like JSON or XML. **The
> talk will highlight the specific challenges in this integration and the
> lessons learned that are applicable to integration of other Apache projects
> having their own metadata and data models. *
>

Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Ted Dunning

XML will never die. The Cobol programmers were reincarnated and built
similarly long-lasting generators of XML.

If you have a schema, then it is a reasonable format for Drill to parse, if
only to turn around and write to another format.



On Wed, Apr 6, 2022 at 7:31 PM Paul Rogers  wrote:

> Hi Luoc,
>
> First, what poor soul is asked to deal with large amounts of XML in this
> day and age? I thought we were past the XML madness, except in Maven and
> Hadoop config files.
>
> XML is much like JSON, only worse. JSON at least has well-defined types
> that can be gleaned from JSON syntax. With XML...? Anything goes because
> XML is a document mark-up language, not a data structure description
> language.
>
> The classic problem with XML is that if XML is used to describe a
> reasonable data structure (rows and columns), then it can reasonably be
> parsed into rows and columns. If XML represents a document (or a
> relationship graph), then there is no good mapping to rows and columns.
> This was true 20 years ago and it is true today.
>
> So, suppose your XML represents row-like data. Then an XML parser could
> hope for the best and make a good guess at the types and structure. The XML
> parser could work like the new & improved JSON parser (based on EVF2) which
> Vitalii is working on. (I did the original work and Vitalli has the
> thankless task of updating that work to match the current code.) That JSON
> parser is VERY complex as it infers types on the fly. Quick, what type is
> "a" in [{"a": null}, {"a": null}, {"a": []}]. We don't know. Only when
> {"a": [10]} appears can we say, "Oh! All those "a" were REPEATED INTs!"
>
> An XML parser could use the same tricks. In fact, it can probably use the
> same code. In JSON, the parser sends events, and the Drill code does its
> type inference magic based on those events. An XML parser can emit similar
> events, and make similar decisions.
>
> As you noted, if we have a DTD, we don't have to do schema inference. But,
> we do have to do DTD-to-rows-and-columns inference. Once do that, we use
> the provided schema as you suggested. (The JSON reader I mentioned already
> supports a provided schema to add sanity to the otherwise crazy JSON type
> inference process when data is sparse and changing.)
>
> In fact, if you convert XML to JSON, then the XML-to-JSON parser has to
> make those same decisions. Hopefully someone has already done that and
> users would be willing to use that fancy tool to convert their XML to JSON
> before using Drill. (Of course, if they want good performance, they should
> have converted XML to Parquet instead.)
>
> So, rather than have a super-fancy Drill XML reader, maybe find a
> super-fancy XML-to-Parquet converter, use that once, and then let Drill
> quickly query Parquet. The results will be much better than trying to parse
> XML over and over on each query. Just because we *can* do it doesn't mean
> we *should*.
>
> Thanks,
>
> - Paul
>
>
>
> On Wed, Apr 6, 2022 at 5:01 AM luoc  wrote:
>
> > 
> > Hello dear driller,
> >
> > Before starting the topic, I would like to do a simple survey :
> >
> > 1. Did you know that Drill already supports XML format?
> >
> > 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB,
> > 10MB or 100MB
> >
> > 3. Do you expect that reading XML will be as easy as JSON (Schema
> > Discovery)?
> >
> > Thank you for responding to those questions.
> >
> > XML is different from the JSON file, and if we rely solely on the Drill
> > drive to deduce the structure of the data. (or called *SCHEMA*), the code
> > will get very complex and delicate.
> >
> > For example, inferring array structure and numeric range. So, "provided
> > schema" or "TO_JSON" may be good medicine :
> >
> > *Provided Schema*
> >
> > We can add the DTD or XML Schema (XSD) support for the XML. It can build
> > all value vectors (Writer) before reading data, solving the fields,
> types,
> > and complex nested.
> >
> > However, a definition file is actually a rule validator that allows
> > elements to appear 0 or more times. As a result, it is not possible to
> know
> > if all elements exist until the data is read.
> >
> > Therefore, avoid creating a large number of value vectors that do not
> > actually exist before reading the data.
> >
> > We can build the top schema at the initial stage and add new value
> vectors
> > as needed during the reading phase.
> >
> > *TO_JSON*
> >
> > Read and convert XML directly to JSON, using the JSON Reader for data
> > resolution.
> >
> > It makes it easier for us to query the XML data such as JSON, but
> requires
> > reading the whole XML file in memory.
> >
> > I think the two can be done, so I look forward to your spirited
> discussion.
> >
> > Thanks.
> >
> > - luoc
> >
>

Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Ted Dunning

And if there are zero instances what happens (curiosity here)?



On Wed, Apr 6, 2022 at 12:28 PM Lee, David 
wrote:

> Which is why using a XSD is more or less full proof..
>
> If the pet element is tagged with maxOccurs="unbounded" it implies it
> should be saved as an array even if there is just one occurrence of 
> in your data.
>
> -Original Message-
> From: Ted Dunning 
> Sent: Wednesday, April 6, 2022 11:48 AM
> To: dev 
> Cc: u...@drill.apache.org
> Subject: Re: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> That example:
>
> dog
> > cat
>
>
> can also convert to ["pet":"dog", "pet":"dog']
>
> XML is rife with problems like this.
>
> As you say.
>
> But worse than can be imagined unless you have been hit by these problems.
>
> On Wed, Apr 6, 2022 at 11:39 AM Lee, David  .invalid>
> wrote:
>
> > TO_JSON won't work in cases where..
> >
> > One file contains: dog which converts to {"pet":"dog"}
> >
> > But another file contains:
> > dog
> > cat
> > which converts to: {"pet": ["dog", "cat"]}
> >
> > pet as a column in Drill can't be both a varchar and an array of
> > varchar
> >
> > There are a ton of gotcha(s) when dealing with XML..
> > numeric vs string
> > scalar vs array
> >
> > -Original Message-
> > From: Lee, David
> > Sent: Wednesday, April 6, 2022 10:54 AM
> > To: u...@drill.apache.org; dev@drill.apache.org
> > Subject: RE: [DISCUSS] Add schema support for the XML format
> >
> > I wrote something to convert XML to JSON using an XSD schema file to
> > solve fields, types, nested structures, etc.. It's the only real way
> > to ensure column level data integrity.
> >
> > https://urldefense.com/v3/__https://github.com/davlee1972/xml_to_json_
> > _;!!KSjYCgUGsB4!JXBZmU6Z9rag7GO9okdk22y102IZz1gw3IThP06jk-0bTwJiGLlbm8
> > HnWC64OWFHods$
> >
> > Converts XML to valid JSON or JSONL Requires only two files to get
> > started. Your XML file and the XSD schema file for that XML file.
> >
> > -Original Message-
> > From: luoc 
> > Sent: Wednesday, April 6, 2022 5:01 AM
> > To: u...@drill.apache.org; dev@drill.apache.org
> > Subject: [DISCUSS] Add schema support for the XML format
> >
> > External Email: Use caution with links and attachments
> >
> >
> > Hello dear driller,
> >
> > Before starting the topic, I would like to do a simple survey :
> >
> > 1. Did you know that Drill already supports XML format?
> >
> > 2. If yes, what is the maximum size for the XML files you normally read?
> > 1MB, 10MB or 100MB
> >
> > 3. Do you expect that reading XML will be as easy as JSON (Schema
> > Discovery)?
> >
> > Thank you for responding to those questions.
> >
> > XML is different from the JSON file, and if we rely solely on the
> > Drill drive to deduce the structure of the data. (or called SCHEMA),
> > the code will get very complex and delicate.
> >
> > For example, inferring array structure and numeric range. So,
> > "provided schema" or "TO_JSON" may be good medicine :
> >
> > Provided Schema
> >
> > We can add the DTD or XML Schema (XSD) support for the XML. It can
> > build all value vectors (Writer) before reading data, solving the
> > fields, types, and complex nested.
> >
> > However, a definition file is actually a rule validator that allows
> > elements to appear 0 or more times. As a result, it is not possible to
> > know if all elements exist until the data is read.
> >
> > Therefore, avoid creating a large number of value vectors that do not
> > actually exist before reading the data.
> >
> > We can build the top schema at the initial stage and add new value
> > vectors as needed during the reading phase.
> >
> > TO_JSON
> >
> > Read and convert XML directly to JSON, using the JSON Reader for data
> > resolution.
> >
> > It makes it easier for us to query the XML data such as JSON, but
> > requires reading the whole XML file in memory.
> >
> > I think the two can be done, so I look forward to your spirited
> discussion.
> >
> > Thanks.
> >
> > - luoc
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender
> > immediately and delete this message. See
> > http://www.blackrock.com/corporate/compliance/email-disclaimers for
> > further information.  Please refer to
> > http://www.blackrock.com/corporate/compliance/privacy-policy for more
> > information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> > http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2022 BlackRock, Inc. All rights reserved.
> >
>

Re: [DISCUSS] Add schema support for the XML format

2022-04-06 Thread Ted Dunning

That example:

dog
> cat


can also convert to ["pet":"dog", "pet":"dog']

XML is rife with problems like this.

As you say.

But worse than can be imagined unless you have been hit by these problems.

On Wed, Apr 6, 2022 at 11:39 AM Lee, David 
wrote:

> TO_JSON won't work in cases where..
>
> One file contains: dog which converts to {"pet":"dog"}
>
> But another file contains:
> dog
> cat
> which converts to: {"pet": ["dog", "cat"]}
>
> pet as a column in Drill can't be both a varchar and an array of varchar
>
> There are a ton of gotcha(s) when dealing with XML..
> numeric vs string
> scalar vs array
>
> -Original Message-
> From: Lee, David
> Sent: Wednesday, April 6, 2022 10:54 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: RE: [DISCUSS] Add schema support for the XML format
>
> I wrote something to convert XML to JSON using an XSD schema file to solve
> fields, types, nested structures, etc.. It's the only real way to ensure
> column level data integrity.
>
> https://github.com/davlee1972/xml_to_json
>
> Converts XML to valid JSON or JSONL Requires only two files to get
> started. Your XML file and the XSD schema file for that XML file.
>
> -Original Message-
> From: luoc 
> Sent: Wednesday, April 6, 2022 5:01 AM
> To: u...@drill.apache.org; dev@drill.apache.org
> Subject: [DISCUSS] Add schema support for the XML format
>
> External Email: Use caution with links and attachments
>
>
> Hello dear driller,
>
> Before starting the topic, I would like to do a simple survey :
>
> 1. Did you know that Drill already supports XML format?
>
> 2. If yes, what is the maximum size for the XML files you normally read?
> 1MB, 10MB or 100MB
>
> 3. Do you expect that reading XML will be as easy as JSON (Schema
> Discovery)?
>
> Thank you for responding to those questions.
>
> XML is different from the JSON file, and if we rely solely on the Drill
> drive to deduce the structure of the data. (or called SCHEMA), the code
> will get very complex and delicate.
>
> For example, inferring array structure and numeric range. So, "provided
> schema" or "TO_JSON" may be good medicine :
>
> Provided Schema
>
> We can add the DTD or XML Schema (XSD) support for the XML. It can build
> all value vectors (Writer) before reading data, solving the fields, types,
> and complex nested.
>
> However, a definition file is actually a rule validator that allows
> elements to appear 0 or more times. As a result, it is not possible to know
> if all elements exist until the data is read.
>
> Therefore, avoid creating a large number of value vectors that do not
> actually exist before reading the data.
>
> We can build the top schema at the initial stage and add new value vectors
> as needed during the reading phase.
>
> TO_JSON
>
> Read and convert XML directly to JSON, using the JSON Reader for data
> resolution.
>
> It makes it easier for us to query the XML data such as JSON, but requires
> reading the whole XML file in memory.
>
> I think the two can be done, so I look forward to your spirited discussion.
>
> Thanks.
>
> - luoc
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>

Re: [VOTE] Adopt the Drill Test Framework from MapR

2022-03-17 Thread Ted Dunning

Big +1 from me!!!

We at MapR built that test framework to support aggressive testing of
Drill. It would be great to see it continue to meet that need.

(... I was CTO at MapR when it was acquired by HPE but I don't have any
special position in the Drill community ...)


On Thu, Mar 17, 2022 at 2:03 AM James Turton  wrote:

> Hi dev community!
>
> Many of you need no introduction to the test framework developed by MapR
>
> https://github.com/mapr/drill-test-framework
>
> . For those who don't know, the test framework contains around 10k tests
> often exercising scenarios not covered by Drill's unit tests. Just weeks
> ago it revealed a regression in a Drill 1.20 RC and saved us from
> shipping with that bug. The linked repository has been dormant for going
> on two years but I am aware of bits of work that have been done on the
> test framework since, and today Anton is actively dusting off and
> updating it. Since the codebase is under the Apache 2.0 license, we are
> free to bring a copy into the Drill project. I've created a new
> (currently empty) possible home for the test framework at
>
> https://github.com/apache/drill-test-framework
>
> Before I proceed to push a clone there, please vote if you support or
> oppose our adoption of the test framework.
>
> P.S. I have also sent a message to a contact at HPE just in case they
> might be aware of some concern applicable to our copying this repo but,
> given the license applied, I cannot see that there will be be one.
> Should anything get raised (and we'd decided to proceed) I would, of
> course, pause so that we can discuss.
>
> Regards
> James
>

thinking of our Ukrainian friends

2022-02-23 Thread Ted Dunning

For commercial historical reasons many of the people who have contributed
to Drill live in Ukraine.

My heart is with them tonight. I hope they stay safe.

Re: [DISCUSS] Some ideas for Drill 1.21

2022-02-09 Thread Ted Dunning

The planning time has been extensively analyzed.

It is inherent in a Volcano-style cost-based optimizer. This is a
branch-and-bound search of an exponential design space.

This bottleneck is very well understood.

Further, it has been accelerated under specialized conditions. As part of
OJAI, there was a limited form of Drill that was included that could work
on specific kinds of tables built into MapR FS. With some rather severe
truncations of the space that the optimizer had to search, the planning
time could be reduced to tens of milliseconds. That was fine for a limited
mission, but some of the really dramatic benefits of Drill on large
queries across complex domains would be impossible with that truncated rule
set.



On Wed, Feb 9, 2022 at 7:06 PM Paul Rogers  wrote:

> Hi All,
>
> Would be great to understand the source of the slow planning. Back in the
> day, I recall colleagues trying all kinds of things to speed up planning,
> but without the time to really figure out where the time went.
>
> I wonder if the two points are related. If most of that planning time is
> spent waiting for a plugin metadata, then James' & Charles' issue could
> possibly be the cause of the slowness that Ted saw.
>
> James, it is still not clear what plugin metadata is being retrieved, and
> when. Now, it is hard to figure that out; that code is complex. Ideally, if
> you have a dozen plugins enabled, but query only one, then only that one
> should be doing anything. Further, if you're using an external system (like
> JDBC), the plugin should query the remote system tables only for the
> table(s) you hit in your query. If the code asks ALL plugins for
> information, or grabs all tables from the remote system, they, yeah, it's
> going to be slow.
>
> Adding per-plugin caching might make sense. For JDBC, say, it is not likely
> that the schema of the remote DB changes between queries, so caching for
> some amount of time is probably fine. And, if a query asks for an unknown
> column, the plugin could refresh metadata to see if the column was just
> added. (I was told that Impala users constantly had to run REFRESH METADATA
> to pick up new files added to HDFS.)
>
> For the classic, original use case (Parquet or CSV files on an HDFS-like
> system), the problem was the need to scan the directory structure at plan
> time to figure out which files to scan at run time. For Parquet, the
> planner also wants to do Parquet row group pruning, which requires reading
> the header of every one of the target files. Since this was slow, Drill
> would create a quick & dirty cache, but with large numbers of files, even
> reading that cache was slow (and, Drill would rebuild it any time a
> directory changed, which greatly slowed planning.)
>
> For that classic use case, saved plans never seemed a win because the
> "shape" of the query heavily depended on the WHERE clause: one clause might
> hit a small set of files, another hit a large set, and that then throws off
> join planning, hash/broadcast exchange decisions and so on.
>
> So, back to the suggestion to start with understanding where the time goes.
> Any silly stuff we can just stop doing? Is the cost due to external
> factors, such as those cited above? Or, is Calcite itself just heavy
> weight? Calcite is a rules engine. Add more rules or more nodes in the DAG,
> and the cost of planning rises steeply. So, are we fiddling about too much
> in the planning process?
>
> One way to test: use a mock data source and plan-time components to
> eliminate all external factors. Time various query shapes using EXPLAIN.
> How long does Calcite take? If a long time, then we've got a rather
> difficult problem as Calcite is hard to fix/replace.
>
> Then, time the plugins of interest. Figure out how to optimize those.
>
> My guess is that the bottleneck won't turn out to be what we think it is.
> It usually isn't.
>
> - Paul
>
> On Tue, Feb 8, 2022 at 8:19 AM Ted Dunning  wrote:
>
> > James, you make some good points.
> >
> > I would generally support what you say except for one special case. I
> think
> > that there is a case to be made to be able to cache query plans in some
> > fashion.
> >
> > The traditional approach to do this is to use "prepared queries" by which
> > the application signals that it is willing to trust that a query plan
> will
> > continue to be correct for the duration of its execution. My experience
> > (and I think the industry's as well) is that the query plan is more
> stable
> > than the underlying details of the metadata and this level of caching (or
> > more) is a very good idea.
> >
> > In particular, the benefit to Drill is that we have a very expensive
>

Re: [DISCUSS] Some ideas for Drill 1.21

2022-02-08 Thread Ted Dunning

ness is acceptable, but correct me if
> I'm wrong.  My conclusion here is that I'd rather do this last, and only
> after careful consideration.
>
> [1] https://infra.apache.org/mailing-list-moderation.html
> [2] https://www.apache.org/foundation/how-it-works.html#management
> [3] https://github.com/apache/drill/pull/2388
>
> On 2022/02/07 21:05, Ted Dunning wrote:
> > Another option is to store metadata as data in a distributed data store.
> > For static resources, that can scale very well. For highly dynamic
> > resources like conventional databases behind JDBC connections, you can
> > generally delegate metadata to that layer. Performance for delegated
> > metadata won't necessarily be great, but those systems are usually either
> > small (like Postgress or mySQL) or fading away (like Hive).
> >
> > Focusing metadata and planning to a single node will make query
> concurrency
> > much worse (and it's already not good).
> >
> >
> > On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers  wrote:
> >
> >> Hi All,
> >>
> >> Drill, like all open source projects, exists to serve those that use
> it. To
> >> that end, the best contributions come when some company needs a feature
> >> badly enough that it is worth the effort to develop and contribute a
> >> solution. That's pretty standard, as along as the contribution is
> general
> >> purpose. In fact, I hope everyone using Drill in support of their
> company
> >> will contribute enhancements back to Drill. If you maintain your own
> >> private fork, you're not helping the community that provided you with
> the
> >> bulk of the code.
> >>
> >> For the info schema, I'm at a loss to guess why this would be slow,
> unless
> >> every plugin is going off and scanning some external source. Knowing
> that
> >> we have a dozen plugins is not slow. Looking at plugin configs is not
> slow.
> >> What could be slow is if you want to know about every possible file in
> HDFS
> >> or S3, every database and table in an external DB, etc. In this case,
> the
> >> bottleneck is either the external system, or the act of querying a dozen
> >> different external systems. Perhap, Charles, you can elaborate on the
> >> specific scenario you have in mind.
> >>
> >> Depending on the core issues, there are various solutions. One solution
> is
> >> to cache all the external metadata in Drill. That's what Impala did with
> >> the Hive Metastore, and it was a mess. I don't expect Drill would do any
> >> better a job. One reason it was a mess is that, in a production system,
> >> there is a vast amount of metadata. You end up playing all manner of
> tricks
> >> to try to compress it. Since Drill (and Impala) are fully symmetric,
> each
> >> node has to hold the entire cache. That is memory that can't be used to
> run
> >> queries. So, to gain performance (for metadata) you give up performance
> (at
> >> run time.)
> >>
> >> One solution is to create a separate metadata cache node. The query
> goes to
> >> some Drillbit that acts as Foreman. The Foreman plans the query and
> >> retrieves the needed metadata from the metadata node. The challenge
> here is
> >> that there will be a large amount of metadata transferred, and the next
> >> thing we know we'll want to cache it in each Drillbit, putting us back
> >> where we started.
> >>
> >> So, one can go another step: shift all query planning to the metadata
> node
> >> and have a single planner node. The user connects to any Drillbit as
> >> Foreman, but that Foreman first talks to the "planner/metadata" node to
> >> give it SQL and get back a plan. The Foreman then runs the plan as
> usual.
> >> (The Foreman runs the root fragment of the plan, which can be compute
> >> intensive, so we don't want the planner node to also act as the
> Foreman.)
> >> The notion here is that the SQL in/plan out is much smaller than the
> >> metadata that is needed to compute the plan.
> >>
> >> The idea about metadata has long been that Drill should provide a
> metadata
> >> API. The Drill metastore should be seen as just one of many metadata
> >> implementations. The Drill metastore is a "starter solution" for those
> who
> >> have not already invested in another solution. (Many shops have HMS or
> >> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> >> metadata/catalog solutions.)
> >>
> >> One ca

Re: [DISCUSS] Some ideas for Drill 1.21

2022-02-07 Thread Ted Dunning

Another option is to store metadata as data in a distributed data store.
For static resources, that can scale very well. For highly dynamic
resources like conventional databases behind JDBC connections, you can
generally delegate metadata to that layer. Performance for delegated
metadata won't necessarily be great, but those systems are usually either
small (like Postgress or mySQL) or fading away (like Hive).

Focusing metadata and planning to a single node will make query concurrency
much worse (and it's already not good).


On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers  wrote:

> Hi All,
>
> Drill, like all open source projects, exists to serve those that use it. To
> that end, the best contributions come when some company needs a feature
> badly enough that it is worth the effort to develop and contribute a
> solution. That's pretty standard, as along as the contribution is general
> purpose. In fact, I hope everyone using Drill in support of their company
> will contribute enhancements back to Drill. If you maintain your own
> private fork, you're not helping the community that provided you with the
> bulk of the code.
>
> For the info schema, I'm at a loss to guess why this would be slow, unless
> every plugin is going off and scanning some external source. Knowing that
> we have a dozen plugins is not slow. Looking at plugin configs is not slow.
> What could be slow is if you want to know about every possible file in HDFS
> or S3, every database and table in an external DB, etc. In this case, the
> bottleneck is either the external system, or the act of querying a dozen
> different external systems. Perhap, Charles, you can elaborate on the
> specific scenario you have in mind.
>
> Depending on the core issues, there are various solutions. One solution is
> to cache all the external metadata in Drill. That's what Impala did with
> the Hive Metastore, and it was a mess. I don't expect Drill would do any
> better a job. One reason it was a mess is that, in a production system,
> there is a vast amount of metadata. You end up playing all manner of tricks
> to try to compress it. Since Drill (and Impala) are fully symmetric, each
> node has to hold the entire cache. That is memory that can't be used to run
> queries. So, to gain performance (for metadata) you give up performance (at
> run time.)
>
> One solution is to create a separate metadata cache node. The query goes to
> some Drillbit that acts as Foreman. The Foreman plans the query and
> retrieves the needed metadata from the metadata node. The challenge here is
> that there will be a large amount of metadata transferred, and the next
> thing we know we'll want to cache it in each Drillbit, putting us back
> where we started.
>
> So, one can go another step: shift all query planning to the metadata node
> and have a single planner node. The user connects to any Drillbit as
> Foreman, but that Foreman first talks to the "planner/metadata" node to
> give it SQL and get back a plan. The Foreman then runs the plan as usual.
> (The Foreman runs the root fragment of the plan, which can be compute
> intensive, so we don't want the planner node to also act as the Foreman.)
> The notion here is that the SQL in/plan out is much smaller than the
> metadata that is needed to compute the plan.
>
> The idea about metadata has long been that Drill should provide a metadata
> API. The Drill metastore should be seen as just one of many metadata
> implementations. The Drill metastore is a "starter solution" for those who
> have not already invested in another solution. (Many shops have HMS or
> Amazon Glue, which is Amazon's version of HMS, or one of the newer
> metadata/catalog solutions.)
>
> One can go even further. Consider file and directory pruning in HMS. Every
> tool has to do the exact same thing: given a set of predicates, find the
> directories and files that match. Impala does it. Spark must do it.
> Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
> do it. Maybe someone has come with the One True Metadata Pruner and Drill
> can just delegate the task to that external tool, and get back the list of
> directories and files to scan. Far better than building yet another pruner.
> (I think Drill currently has two Parquet metadata pruners, duplicating what
> many other tools have done.)
>
> If we see the source of metadata as plugable, then a shop such as DDR that
> has specific needs (maybe caching those external schemas), can build a
> metadata plugin for that use case. If the solution is general, it can be
> contributed to Drill as another metadata option.
>
> In any case, if we can better understand the specific problem you are
> encountering, we can perhaps offer more specific suggestions.
>
> Thanks,
>
> - Paul
>
> On Sun, Feb 6, 2022 at 8:11 AM Charles Givre  wrote:
>
> > Hi Luoc,
> > Thanks for your concern.  Apache projects are often backed unofficially
> by
> > a company.  Drill was, for years, backed my MapR as evident by

Re: [DISCUSS] Lombok - friend or foe?

2022-01-22 Thread Ted Dunning

The Lombok story is better in Intellij, possibly because the Lombok devs
use IntelliJ like the majority of devs. Once I knew to install the plugin,
things were at least comprehensible.

But the problem is that it isn't obvious. As a newcomer, you don't know
what you don't know and because Lombok's major effect is code that isn't
there, it isn't obvious where to look.

The point about it not helping that much due to Drill's design (good point,
paul) is apposite, but I think the naive reader issue is even bigger.

Net, as a person who isn't developing anything for Drill just lately, I
don't think it's a good idea at all.



On Sat, Jan 22, 2022 at 6:37 AM luoc  wrote:

>
> Hi all,
>
> I have a story here. In Oct 2021, I upgraded Eclipse to the latest release
> (2021–09) and then found out that the Lombok dependency was added Drill
> repository, So I installed Lombok (as a new plugin) from Eclipse
> Marketplace as I used to. Finally, restarted the IDE and prepared to open
> the Drill project, but it is crushed cause by the issue #2956 <
> https://github.com/projectlombok/lombok/issues/2956>, Lombok was not
> available until I looked at a temporary solution..
>
> I use both Eclipse and IDEA, but I use Eclipse more often. I have no
> objection to the use of Lombok, but suggest the following three points :
>
> 1. Could we use Lombok only in `drill-contrib` module?
>
> 2. Could we agree not to use Lombok in common module?
>
> 3. It is best to update the dev documentation to describe this results if
> we continue to use Lombok.
>
> In fact, I have the same idea as Paul, more about balancing choices.
>
> Thanks.
>
> > 2022年1月22日 下午5:34，Paul Rogers  写道：
> >
> > Hi All,
> >
> > I look at any tool as a cost/benefit tradeoff. If Drill were a typical
> > business app, with lots of "data objects", then the hassle of Lomboc
> might
> > be a net win. However, the nature of Drill is that we have very few data
> > objects. We have lots of Protobuf objects, or Jackson-serialized objects,
> > but not too many data objects of the kind used with object-relational
> > mappers.
> >
> > On the other hand, I had to spend an hour or so trying to figure out why
> > things would not build in Eclipse. Then, more time to figure out how to
> > install the half-finished Lomboc plugin for Eclipse and various other
> > fiddling.
> >
> > So, I'd guess, on balance, Lombok has cost, and will continue to cost,
> more
> > time than it saved avoiding a few getter/setter methods. And, I agree
> with
> > Ted, Eclipse (and, I assume IntelliJ), is pretty quick at generating
> those
> > methods.
> >
> > Since Lomboc has a cost, and is not a huge win, KISS suggests we avoid
> > adding extra dependencies unnecessarily.
> >
> > That's my 2 cents...
> >
> > - Paul
> >
> >
> >
> > On Fri, Jan 21, 2022 at 8:51 AM Ted Dunning 
> wrote:
> >
> >> A couple of years ago, I had a dev introduce Lombok into some code
> without
> >> me knowing. That let me be a classic naive user.
> >>
> >> The result was total confusion on my part. Sooo much code was being
> >> automagically generated that I couldn't figure out the code and spent a
> lot
> >> of time chasing my tail and very little time looking at the crux of the
> >> code.
> >>
> >> My own personal preference is either
> >>
> >> - use a language like Julia if you want magic. It's fantastic and all to
> >> have amazing stuff and coders expect to see it.
> >>
> >> - use an IDE to generate the boiler plate and put it into its own little
> >> annex in the code with the interesting bits near the top of classes.
> That
> >> lets debuggers and IDEs that don't understand Lombok to function without
> >> impairing readability much. Concurrent with that, use discipline to not
> do
> >> strange things like changing the expected meaning of the boilerplate.
> >>
> >> That's my preference, but I wouldn't want to push that preference very
> >> hard. My own prioritization is on readability of the code by outsiders.
> >>
> >>
> >>
> >>
> >> On Fri, Jan 21, 2022 at 2:25 AM James Turton  wrote:
> >>
> >>> Hi again Devs
> >>>
> >>> This one is simple to describe.  Lombok entered the Drill code base
> this
> >>> year, but not everyone feels that Lombok is appropriate for every code
> >>> base.  To my, fairly limited, understanding the advantage of Lombok is
> >>> that boilerplate code is reduced while the disadvantage is the
> >>> deployment of code generation magic that can have untoward effects on
> >>> build-time tools and IDEs.
> >>>
> >>> So here is a chance to opine on Lombok if you'd like to.  My own
> opinion
> >>> is very near neutral and goes something like "It burned me a bit once,
> >>> but hasn't since, and less boilerplate is nice.  I guess it can stay
> >>> .  I hope I don't regret this one day."
> >>>
> >>> Regards
> >>> James
> >>>
> >>
>
>

Re: [DISCUSS] Lombok - friend or foe?

2022-01-21 Thread Ted Dunning

A couple of years ago, I had a dev introduce Lombok into some code without
me knowing. That let me be a classic naive user.

The result was total confusion on my part. Sooo much code was being
automagically generated that I couldn't figure out the code and spent a lot
of time chasing my tail and very little time looking at the crux of the
code.

My own personal preference is either

- use a language like Julia if you want magic. It's fantastic and all to
have amazing stuff and coders expect to see it.

- use an IDE to generate the boiler plate and put it into its own little
annex in the code with the interesting bits near the top of classes. That
lets debuggers and IDEs that don't understand Lombok to function without
impairing readability much. Concurrent with that, use discipline to not do
strange things like changing the expected meaning of the boilerplate.

That's my preference, but I wouldn't want to push that preference very
hard. My own prioritization is on readability of the code by outsiders.

On Fri, Jan 21, 2022 at 2:25 AM James Turton  wrote:

> Hi again Devs
>
> This one is simple to describe.  Lombok entered the Drill code base this
> year, but not everyone feels that Lombok is appropriate for every code
> base.  To my, fairly limited, understanding the advantage of Lombok is
> that boilerplate code is reduced while the disadvantage is the
> deployment of code generation magic that can have untoward effects on
> build-time tools and IDEs.
>
> So here is a chance to opine on Lombok if you'd like to.  My own opinion
> is very near neutral and goes something like "It burned me a bit once,
> but hasn't since, and less boilerplate is nice.  I guess it can stay
> .  I hope I don't regret this one day."
>
> Regards
> James
>

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Ted Dunning

Paul,

I understood your suggestion.  My point is that publishing to Maven central
is a bit of a pain while publishing by posting to Github is nearly
painless.  In particular, because Github inherently produces a relatively
difficult to fake hash for each commit, referring to a dependency using
that hash is relatively safe which saves a lot of agony regarding keys and
trust.

Further, Github or any comparable service provides the same "already
exists" benefit as does Maven.



On Mon, Jan 17, 2022 at 1:30 PM Paul Rogers  wrote:

> Hi Ted,
>
> Well said. Just to be clear, I wasn't suggesting that we use
> Maven-the-build-tool to distribute plugins. Rather, I was simply observing
> that building a global repo is a bit of a project and asked, "what could we
> use that already exists?" The Python repo? No. The Ubuntu/RedHat/whatever
> Linux repos? Maybe. Maven's repo? Why not?
>
> The idea would be that Drill might have a tool that says, "install the
> FooBlaster" plugin. It downloads from a repo (Maven central, say) and puts
> the plugin in the proper plugins directory. In a cluster, either it does
> that on every node, or the work is done as part of preparing a Docker
> container which is then pushed to every node.
>
> The key thought is just to make the problem simpler by avoiding the need
> to create and maintain a Drill-specific repo when we can barely have enough
> resources to keep Drill itself afloat.
>
> None of this can happen, however, unless we clean up the plugin APIs and
> ensure plugins can be built outside of the Drill repo. (That means, say,
> that Drill needs an API library that resides in Maven.)
>
> There are probably many ways this has been done. Anyone know of any good
> examples we can learn from?
>
> Thanks,
>
> - Paul
>
>
> On Mon, Jan 17, 2022 at 9:40 AM Ted Dunning  wrote:
>
>>
>> I don't think that Maven is a forced move just because Drill is in Java.
>> It may be a good move, but it isn't a forgone conclusion. For one thing,
>> the conventions that Maven uses are pretty hard-wired and it may be
>> difficult to have a reliable deny-list of known problematic plugins.
>> Publishing to Maven is more of a pain than simply pushing to github.
>>
>> The usability here is paramount both for the ultimate Drill user, but
>> also for the writer of plugins.
>>
>>
>>
>> On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:
>>
>>> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
>>> is probably better fit than GitHub for distribution?  If Drillbits can
>>> write to their jars/3rdparty directory then I can imagine Drill gaining
>>> the ability to fetch and install plugins itself without too much
>>> trouble, at least for Drill clusters with Internet access.
>>> "Sideloading" by downloading from Maven and copying manually would
>>> always remain possible.
>>>
>>> @Paul I'll try to get a little time with you to get some ideas about
>>> designing a plugin API.
>>>
>>> On 2022/01/14 23:20, Paul Rogers wrote:
>>> > Hi All,
>>> >
>>> > James raises an important issue, I've noticed that it used to be easy
>>> to
>>> > build and test Drill, now it is a struggle, because of the many odd
>>> > external dependencies we have introduced. That acts as a big damper on
>>> > contributions: none of us get paid enough to spend more time fighting
>>> > builds than developing the code...
>>> >
>>> > Ted is right that we need a good way to install plugins. There are two
>>> > parts. Ted is talking about the high-level part: make it easy to point
>>> to
>>> > some repo and use the plugin. Since Drill is Java, the Maven repo
>>> could be
>>> > a good mechanism. In-house stuff is often in an internal repo that does
>>> > whatever Maven needs.
>>> >
>>> > The reason that plugins are in the Drill project now is that Drill's
>>> "API"
>>> > is all of Drill. Plugins can (and some do) access all of Drill though
>>> the
>>> > fragment context. The API to Calcite and other parts of Drill are
>>> wide, and
>>> > tend to be tightly coupled with Drill internals. By contrast, other
>>> tools,
>>> > such as Presto/Trino, have defined very clean APIs that extensions
>>> use. In
>>> > Druid, everything is integrated via Google Guice and an extension can
>>> > replace any part of Druid (though, I'm not convinced that's actually a
>>> good
>>>

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-17 Thread Ted Dunning

I don't think that Maven is a forced move just because Drill is in Java. It
may be a good move, but it isn't a forgone conclusion. For one thing, the
conventions that Maven uses are pretty hard-wired and it may be difficult
to have a reliable deny-list of known problematic plugins. Publishing to
Maven is more of a pain than simply pushing to github.

The usability here is paramount both for the ultimate Drill user, but also
for the writer of plugins.



On Mon, Jan 17, 2022 at 5:06 AM James Turton  wrote:

> Thank you Ted and Paul for the feedback.  Since Java is compiled, Maven
> is probably better fit than GitHub for distribution?  If Drillbits can
> write to their jars/3rdparty directory then I can imagine Drill gaining
> the ability to fetch and install plugins itself without too much
> trouble, at least for Drill clusters with Internet access.
> "Sideloading" by downloading from Maven and copying manually would
> always remain possible.
>
> @Paul I'll try to get a little time with you to get some ideas about
> designing a plugin API.
>
> On 2022/01/14 23:20, Paul Rogers wrote:
> > Hi All,
> >
> > James raises an important issue, I've noticed that it used to be easy to
> > build and test Drill, now it is a struggle, because of the many odd
> > external dependencies we have introduced. That acts as a big damper on
> > contributions: none of us get paid enough to spend more time fighting
> > builds than developing the code...
> >
> > Ted is right that we need a good way to install plugins. There are two
> > parts. Ted is talking about the high-level part: make it easy to point to
> > some repo and use the plugin. Since Drill is Java, the Maven repo could
> be
> > a good mechanism. In-house stuff is often in an internal repo that does
> > whatever Maven needs.
> >
> > The reason that plugins are in the Drill project now is that Drill's
> "API"
> > is all of Drill. Plugins can (and some do) access all of Drill though the
> > fragment context. The API to Calcite and other parts of Drill are wide,
> and
> > tend to be tightly coupled with Drill internals. By contrast, other
> tools,
> > such as Presto/Trino, have defined very clean APIs that extensions use.
> In
> > Druid, everything is integrated via Google Guice and an extension can
> > replace any part of Druid (though, I'm not convinced that's actually a
> good
> > idea.) I'm sure there are others we can learn from.
> >
> > So, we need to define a plugin API for Drill. I started down that route a
> > while back: the first step was to refactor the plugin registry so it is
> > ready for extensions. The idea was to use the same mechanism for all
> kinds
> > of extensions (security, UDFs, metastore, etc.) The next step was to
> build
> > something that roughly followed Presto, but that kind of stalled out.
> >
> > In terms of ordering, we'd first need to define the plugin API. Then, we
> > can shift plugins to use that. Once that is done, we can move plugins to
> > separate projects. (The metastore implementation can also move, if we
> > want.) Finally, figure out a solution for Ted's suggestion to make it
> easy
> > to grab new extensions. Drill is distributed, so adding a new plugin has
> to
> > happen on all nodes, which is a bit more complex than the typical
> > Julia/Python/R kind of extension.
> >
> > The reason we're where we're at is that it is the path of least
> resistance.
> > Creating a good extension mechanism is hard, but valuable, as Ted noted.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Thu, Jan 13, 2022 at 10:18 PM Ted Dunning
> wrote:
> >
> >> The bigger reason for a separate plug-in world is the enhancement of
> >> community.
> >>
> >> I would recommend looking at the Julia community for examples of
> >> effective ways to drive plug in structure.
> >>
> >> At the core, for any pure julia package, you can simply add a package by
> >> referring to the github repository where the package is stored. For
> >> packages that are "registered" (i.e. a path and a checksum is recorded
> in a
> >> well known data store), you can add a package by simply naming it
> without
> >> knowing the path.  All such plugins are tested by the authors and the
> >> project records all dependencies with version constraints so that
> cascading
> >> additions are easy. The community leaders have made tooling available so
> >> that you can test your package against a range of versions of Julia by
> >> pretty simple (to use) Github actions.
> >>
>

Re: Re: [DISCUSS] Per User Access Controls

2022-01-13 Thread Ted Dunning

GRANT and REVOKE implicitly assumes that the database is king of access
control. That works when the database owns the data.

In the modern world where data storage is separated from query, it is truly
painful to have to manage permissions for each analysis and each query tool
and nearly impossible to keep them synchronized. Likewise, it is impossible
to get plugins for systems like Ranger for all possible tools and
impossible for Ranger to even understand all tools.

For instance, suppose you have S3 data, files and database. Each has
permissions already defined. Now you have users who want to use Drill (for
SQL processing), Jupyter notebooks with Python for data engineering, Julia
with Pluto notebooks for numerical work and batchwise Spark jobs all for
building data pipelines across all the kinds of data. Neither Python, Julia
nor Spark can really be protected by Ranger. All assume file permissions or
S3 IAMs do that job.



On Thu, Jan 13, 2022 at 10:49 PM Z0ltrix  wrote:

> Hi @All,
>
> for me, that uses Drill with a kerberized hadoop cluster and Ranger as
> central Access-Control-System i would love to have an Ranger-Plugin for
> Drill, but i would assume a lot Drill users just spins up a cluster in
> front of S3 or azure.
>
> So why not using a generic approach with GRANT and REVOKE for users and
> groups on specific workspaces, or at least storage plugins?
>
> With that an admin can control which users and groups can access all
> storage plugins we have, no matter if the underneath plugin has such a
> system.
>
>
> Maybe we could use the Metastore to store such information?
>
> Regards,
> Christian
>
> ‐‐‐ Original Message ‐‐‐
>
> Paul Rogers  schrieb am Donnerstag, 13. Januar 2022 um
> 23:40:
>
> > Hey All,
> >
>
> > Other members of the Hadoop Ecosystem rely on external systems to handle
> >
>
> > permissions: Ranger or Sentry. There is probably something different in
> the
> >
>
> > AWS world.
> >
>
> > As you look into security, you'll see that you need to maintain
> permissions
> >
>
> > on many entities: files, connections, etc. You need different
> permissions:
> >
>
> > read, write, create, etc. In larger groups of people, you need roles:
> admin
> >
>
> > role, sales analyst role, production engineer role. Users map to roles,
> and
> >
>
> > roles take permissions.
> >
>
> > Creating this just for Drill is not effective: no one wants to learn a
> >
>
> > Drill "Security Store" any more than folks want to learn the "Drill
> >
>
> > metastore". Drill is seldom the only tool in a shop: people want to set
> >
>
> > permissions in one place, not in each tool. So, we should integrate with
> >
>
> > existing tools.
> >
>
> > Drill should provide an API, and be prepared to enforce rules. Drill
> >
>
> > defines the entities that can be secured, and the available permissions.
> >
>
> > Then, it is up to an external system to provide user identity, take
> tuples
> >
>
> > of (user, resource, permission) and return a boolean of whether that user
> >
>
> > is authorized or not. MapR, Pam, Hadoop and other systems would be
> >
>
> > implemented on top of the Drill permissions API, as would whatever need
> you
> >
>
> > happen to have.
> >
>
> > Thanks,
> >
>
> > -   Paul
> >
>
> > On Thu, Jan 13, 2022 at 12:32 PM Curtis Lambert
> cur...@datadistillr.com
> >
>
> > wrote:
> >
>
> > > This is what we are handling with Vault outside of Drill, combined with
> > >
>
> > > aliasing. James is tracking some of what you've been finding with the
> > >
>
> > > credential store but even then we want the single source of auth. We
> can
> > >
>
> > > chat with James on the next Drill stand up (and anyone else who wants
> to
> > >
>
> > > feel the pain).
> > >
>
> > > [image: avatar]
> > >
>
> > > Curtis Lambert
> > >
>
> > > CTO
> > >
>
> > > Email:
> > >
>
> > > cur...@datadistillr.com cur...@datdistillr.com
> > >
>
> > > Phone:
> > >
>
> > > -   706-402-0249
> > >
>
> > > [image: LinkedIn]LinkedIn
> > >
>
> > > https://www.linkedin.com/in/curtis-lambert-2009b2141/ [image:
> Calendly]
> > >
>
> > > Calendly https://calendly.com/curtis283/generic-zoom
> > >
>
> > > [image: Data Distillr logo] https://www.datadistillr.com/
> > >
>
> > > On Thu, Jan 13, 2022 at 3:29 PM Charles Givre cgi...@gmail.com wrote:
> > >
>
> > > > Hello all,
> > > >
>
> > > > One of the issues we've been dancing around is having per-user access
> > > >
>
> > > > controls in Drill. As Drill was originally built around the Hadoop
> > > >
>
> > > > ecosystem, the Hadoop based connections make use of
> user-impersonation
> > > >
>
> > > > for
> > > >
>
> > > > per user access controls. However, a rather glaring deficiency is the
> > > >
>
> > > > lack
> > > >
>
> > > > of per-user access controls for connections like JDBC, Mongo, Splunk
> etc.
> > > >
>
> > > > Recently when I was working on OAuth pull request, it occurred to me
> that
> > > >
>
> > > > we might be able to slightly extend the credential provider
>

Re: [DISCUSS] Drill 2 and plug-in organisation

2022-01-13 Thread Ted Dunning

The bigger reason for a separate plug-in world is the enhancement of
community.

I would recommend looking at the Julia community for examples of
effective ways to drive plug in structure.

At the core, for any pure julia package, you can simply add a package by
referring to the github repository where the package is stored. For
packages that are "registered" (i.e. a path and a checksum is recorded in a
well known data store), you can add a package by simply naming it without
knowing the path.  All such plugins are tested by the authors and the
project records all dependencies with version constraints so that cascading
additions are easy. The community leaders have made tooling available so
that you can test your package against a range of versions of Julia by
pretty simple (to use) Github actions.

The result has been an absolute explosion in the number of pure Julia
packages.

For packages that include C or Fortran (or whatever) code, there is some
amazing tooling available that lets you record a build process on any of
the supported platforms (Linux, LinuxArm, 32 or 64 bit, windows, BSD, OSX
and so on). WHen you register such a package, it is automagically built on
all the platforms you indicate and the binary results are checked into a
central repository known as Yggdrasil.

All of these registration events for different packages are recorded in a
central registry as I mentioned. That registry is recorded in Github as
well which makes it easy to propagate changes.

On Thu, Jan 13, 2022 at 8:45 PM James Turton  wrote:

> Hello dev community
>
> Discussions about reorganising the Drill source code to better position
> the project to support plug-ins for the "long tail" of weird and
> wonderful systems and data formats have been coming up here and there
> for a few months, e.g. in https://github.com/apache/drill/pull/2359.
>
> A view which I personally share is that adding too large a number and
> variety of plug-ins to the main tree would create a lethal maintenance
> burden for developers working there and lead down a road of accumulating
> technical debt.  The Maven tricks we must employ to harmonise the
> growing set of dependencies of the main tree to keep it buildable are
> already enough, as is the size of our distributable and the count of
> open bug reports.
>
>
> Thus, the idea of splitting out "/contrib" into a new
> apache/drill-contrib repo after selecting a subset of plugins to remain
> in apache/drill.  I'll now volunteer a set of criteria to decide whether
> a plug-in should live in this notional apache/drill-contrib.
>
>  1. The plug-in queries an unstructured data format (even if it only
> reads metadata fields) e.g. Image format plug-in.
>  2. The plug-in queries a data format that was designed for human
> consumption e.g. Excel format plug-in.
>  3. The plug-in cannot be expected to run with speed and reliability
> comparable to querying structured data on the local network e.g.
> Dropbox storage plugin.
>  4. The plug-in queries an obscure system or format e.g. we receive a
> plug-in for some data format used only on old Cray supercomputers.
>  5. The plug-in can for some reason not be well supported by the Drill
> devs e.g. it has a JNI dependency on some difficult native libs.
>
>
> Any one of those suggests that an apache/drill-contrib is the better
> home to me, but what is your view?  Would we apply significantly more
> relaxed standards when reviewing PRs to apache/drill-contrib?  Would we
> tag, build and test apache/drill-contrib with every release of
> apache/drill, or would it run on its own schedule, perhaps with users
> downloading builds made continuously from snapshots of HEAD?
>
>
> Regards
> James
>
>
>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-04 Thread Ted Dunning

Paul,

That could be made to work. I suspect that the flow won't be as simple as
with an open google doc having worked in both styles.

I will go either way.

On Tue, Jan 4, 2022 at 5:29 PM Paul Rogers  wrote:

> Hi Ted,
>
> I like where you're going with how to manage the discussion.
>
> Here's a trick that I saw someone do recently. The design/discussion as a
> PR.
> Comments are just code review comments, tagged to a specific line. The "er,
> never mind"
> aspect that Ted talks about is handled by pushing a new version of the doc
> (if the doc contains the error) or editing a comment (if the comment had
> the
> error.) The history of all changes is in the commit history.
>
> As we go off on tangents (Arrow-based API? Modern way to do code gen?),
> these can
> be handled as new documents.
>
> All we need is a place to put this stuff. A "docs" or "design" directory
> within the
> source tree?
>
> Thanks,
>
> - Paul
>
> On Tue, Jan 4, 2022 at 11:15 AM Ted Dunning  wrote:
>
> > Exactly. I very much had in mind an "On the other hand" kind of document.
> >
> > The super benefit of a non-threaded presentation is that if I advocate
> > something stupid due to an oversight on my part, I can go back and edit
> > away the stupid statement (since it shouldn't be part of the consensus)
> and
> > tag anybody who might have responded. I might even leave a note saying
> "You
> > might think X, but that isn't so because of Y" to help later readers.
> >
> > That is all very, very hard to do in threaded discussions.
> >
> >
> >
> > On Tue, Jan 4, 2022 at 9:37 AM James Turton  wrote:
> >
> > > Ah, and I see now that you said as much already.  So a collaboratively
> > > edited document?  Wiki pages containing a variety of independent views
> > > might turn out something like this collection I suppose
> > >
> > > https://wiki.c2.com/?GarbageCollection
> > >
> > > which isn't bad IMHO.
> > >
> > > On 2022/01/04 16:42, Ted Dunning wrote:
> > > > Threading is exactly what I would want to avoid.
> > > >
> > > >
> > > >
> > > > On Tue, Jan 4, 2022, 3:58 AM James Turton  > > > <mailto:dz...@apache.org>> wrote:
> > > >
> > > > Hi all
> > > >
> > > > GitHub Issues allow a conversation thread with rich formatting
> so I
> > > > propose that we use them for meaty topics like this.  Please use
> > the
> > > > "Feature Request" issue template for this purpose, and set the
> > > issue's
> > > > Project field to "Drill 2.0"[1], said project having recently
> been
> > > > created by Charles.  I am busy transcribing the current
> discussion
> > > from
> > > > the mailing list and a GitHub PR to just such a new feature
> request
> > > at
> > > >
> > > > https://github.com/apache/drill/issues/2421
> > > > <https://github.com/apache/drill/issues/2421>
> > > >
> > > > James
> > > >
> > > > [1] https://github.com/apache/drill/projects/1
> > > > <https://github.com/apache/drill/projects/1>
> > > >
> > > > On 2022/01/04 09:49, Ted Dunning wrote:
> > > >  > I wonder if there isn't a better place for this discussion?
> > > >  >
> > > >  > As you point out, there are many threads and many of the
> points
> > > > are rather
> > > >  > contentious technically. That will make them even harder to
> > > > follow in an
> > > >  > email thread.
> > > >  >
> > > >  > We could just use the wiki and format the text in the form of
> > > > questions
> > > >  > with alternative positions.
> > > >  >
> > > >  > Or we could use an open google document with similar form.
> > > >  >
> > > >  > What's the preference here?
> > > >  >
> > > >
> > >
> >
>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-04 Thread Ted Dunning

Exactly. I very much had in mind an "On the other hand" kind of document.

The super benefit of a non-threaded presentation is that if I advocate
something stupid due to an oversight on my part, I can go back and edit
away the stupid statement (since it shouldn't be part of the consensus) and
tag anybody who might have responded. I might even leave a note saying "You
might think X, but that isn't so because of Y" to help later readers.

That is all very, very hard to do in threaded discussions.



On Tue, Jan 4, 2022 at 9:37 AM James Turton  wrote:

> Ah, and I see now that you said as much already.  So a collaboratively
> edited document?  Wiki pages containing a variety of independent views
> might turn out something like this collection I suppose
>
> https://wiki.c2.com/?GarbageCollection
>
> which isn't bad IMHO.
>
> On 2022/01/04 16:42, Ted Dunning wrote:
> > Threading is exactly what I would want to avoid.
> >
> >
> >
> > On Tue, Jan 4, 2022, 3:58 AM James Turton  > <mailto:dz...@apache.org>> wrote:
> >
> > Hi all
> >
> > GitHub Issues allow a conversation thread with rich formatting so I
> > propose that we use them for meaty topics like this.  Please use the
> > "Feature Request" issue template for this purpose, and set the
> issue's
> > Project field to "Drill 2.0"[1], said project having recently been
> > created by Charles.  I am busy transcribing the current discussion
> from
> > the mailing list and a GitHub PR to just such a new feature request
> at
> >
> > https://github.com/apache/drill/issues/2421
> > <https://github.com/apache/drill/issues/2421>
> >
> > James
> >
> > [1] https://github.com/apache/drill/projects/1
> > <https://github.com/apache/drill/projects/1>
> >
> > On 2022/01/04 09:49, Ted Dunning wrote:
> >  > I wonder if there isn't a better place for this discussion?
> >  >
> >  > As you point out, there are many threads and many of the points
> > are rather
> >  > contentious technically. That will make them even harder to
> > follow in an
> >  > email thread.
> >  >
> >  > We could just use the wiki and format the text in the form of
> > questions
> >  > with alternative positions.
> >  >
> >  > Or we could use an open google document with similar form.
> >  >
> >  > What's the preference here?
> >  >
> >
>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-04 Thread Ted Dunning

Threading is exactly what I would want to avoid.



On Tue, Jan 4, 2022, 3:58 AM James Turton  wrote:

> Hi all
>
> GitHub Issues allow a conversation thread with rich formatting so I
> propose that we use them for meaty topics like this.  Please use the
> "Feature Request" issue template for this purpose, and set the issue's
> Project field to "Drill 2.0"[1], said project having recently been
> created by Charles.  I am busy transcribing the current discussion from
> the mailing list and a GitHub PR to just such a new feature request at
>
> https://github.com/apache/drill/issues/2421
>
> James
>
> [1] https://github.com/apache/drill/projects/1
>
> On 2022/01/04 09:49, Ted Dunning wrote:
> > I wonder if there isn't a better place for this discussion?
> >
> > As you point out, there are many threads and many of the points are
> rather
> > contentious technically. That will make them even harder to follow in an
> > email thread.
> >
> > We could just use the wiki and format the text in the form of questions
> > with alternative positions.
> >
> > Or we could use an open google document with similar form.
> >
> > What's the preference here?
> >
>
>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Ted Dunning

I wonder if there isn't a better place for this discussion?

As you point out, there are many threads and many of the points are rather
contentious technically. That will make them even harder to follow in an
email thread.

We could just use the wiki and format the text in the form of questions
with alternative positions.

Or we could use an open google document with similar form.

What's the preference here?



On Mon, Jan 3, 2022 at 7:34 PM Paul Rogers  wrote:

> Hi Charles,
>
> The material is rather dense and benefits from the Github formatting. To
> preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki
> page.
>
> For now, the link to the discussion is [1]. Since the Wiki is not good for
> discussions, let's have that discussion here (if anyone is up to tackling
> such a weighty subject.)
>
> Thanks,
>
> - Paul
>
> [1] https://github.com/apache/drill/pull/2412
>
> On Mon, Jan 3, 2022 at 5:15 PM Charles Givre  wrote:
>
> > @Paul,
> > Do you mind if I copy the contents of your response to DRILL-8088 to this
> > thread?   There's a lot of good info there, and I'd hate to see it get
> lost.
> > -- C
> >
> > > On Jan 3, 2022, at 7:41 PM, Paul Rogers  wrote:
> > >
> > > Hi All,
> > >
> > > Thanks Charles for dredging up that old discussion, your memory is
> better
> > > than mine! And, thanks Ted for that summary of MapR history. As one of
> > the
> > > "replacement crew" brought in after the original folks left, your
> > > description is consistent with my memory of events. Moreover, as we
> > looked
> > > at what was needed to run Drill in production, an Arrow port was far
> down
> > > on the list: it would not have solved actual customer problems.
> > >
> > > Before we get too excited about Arrow, I think we should have a
> > discussion
> > > about what we want in an internal storage format. I added a long
> (sorry)
> > > set of comments in that PR that Charles mentioned that tries to debunk
> > the
> > > myths that have grown up around using a columnar format as the internal
> > > representation for a query engine. (Columnar is great for storage.) The
> > > note presents the many issues we've encountered over the years that
> have
> > > caused us to layer ever more code on top of vectors to solve various
> > > problems. It also highlights a distributed-systems problem which
> vectors
> > > make far worse.
> > >
> > > Arrow is meant to be portable, as Ted discussed, but it is still
> > columnar,
> > > and this is the source of endless problems in an execution engine. So,
> we
> > > want to ask, what is the optimal format for what Drill actually does?
> I'm
> > > now of the opinion that Drill might actually better benefit  from a
> > > row-based format, similar to what Impala uses. The notes even paint a
> > path
> > > forward.
> > >
> > > Ted's description of the goal for Demio suggests that Arrow might be
> the
> > > right answer for that market. Drill, however, tends to be used to query
> > > myriad data sources at scale and as a "query integrator" across
> systems.
> > > This use case has different needs, which may be better served with a
> > > row-based format.
> > >
> > > The upshot is that "value vectors vs. Arrow" is the wrong place to
> start
> > > the discussion. The right place is "what does our many years of
> > experience
> > > with Drill suggest is the most efficient format for how Drill is
> actually
> > > used?"
> > >
> > > Note that Drill could have an Arrow-based API independent of the
> internal
> > > format. The quote from Charles explains how we could do that.
> > >
> > > Thanks,
> > >
> > > - Paul
> > >
> > > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning 
> > wrote:
> > >
> > >> Christian,
> > >>
> > >> Your thoughts are very helpful. I find Arrow very nice (I use it in
> > Agstack
> > >> with Julia and Python).
> > >>
> > >> I don't think anybody is saying that Drill wouldn't be well set with a
> > >> switch to Arrow or even just interfaces to Arrow. But it is a lot of
> > work
> > >> to make it all happen.
> > >>
> > >>
> > >>
> > >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix 
> wrote:
> > >>
> > >>> Hi Charles, Ted, and the others here,
> > >>

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Ted Dunning

Christian,

Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
with Julia and Python).

I don't think anybody is saying that Drill wouldn't be well set with a
switch to Arrow or even just interfaces to Arrow. But it is a lot of work
to make it all happen.



On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix  wrote:

> Hi Charles, Ted, and the others here,
>
> it is very interesting to hear the evolution of Drill, Dremio and Arrow in
> that context and thank you Charles for restarting that discussion.
>
> I think, and James mentioned this in the PR as well, that Drill could
> benefit from the continues progress, the Arrow project has made since its
> separation from Drill. And the arrow Community seems to be large, so i
> assume this goes on and on with improvements, new features, etc. but i have
> not enough experience in Drill internals to have an Idea in which mass of
> refactoring this would lead.
>
> In addition to that, im not aware of the current roadmap of Arrow and if
> these would fit into Drills roadmap. Maybe Arrow would go into a different
> direction than Drill and what should we do, if Drill is bound to Arrow then?
>
> On the other hand, Arrow could help Drill to a wider adoption with clients
> like pyarrow, arrow-flight, various other programming languages etc. and
> (im not sure about that) maybe its a performance benefit if Drill use Arrow
> to read Data from HDFS(example), useses Arrow to work with it during
> execution and gives the vectors directly to my Python(example) programm via
> arrow-flight so that i can Play around with Pandas, etc.
>
> Just some thoughts i have since i have used Dremio with pyarrow and Drill
> with odbc connections.
>
> Regards
> Christian
>  Original-Nachricht 
> Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
>
>
> Thanks Ted for the perspective! I had always wished to be a "fly on the
> wall" in those conversations. :-)
> -- C
>
> > On Jan 3, 2022, at 11:00 AM, Charles Givre  wrote:
> >
> > Hello all,
> > There was a discussion in a recently closed PR [1] with a discussion
> between z0ltrix, James Turton and a few others about integrating Drill with
> Apache Arrow and wondering why it was never done. I'd like to share my
> perspective as someone who has been around Drill for some time but also as
> someone who never worked for MapR or Dremio. This just represents my
> understanding of events as an outsider, and I could be wrong about some or
> all of this. Please forgive (or correct) any inaccuracies.
> >
> > When I first learned of Arrow and the idea of integrating Arrow with
> Drill, the thing that interested me the most was the ability to move data
> between platforms without having to serialize/deserialize the data. From my
> understanding, MapR did some research and didn't find a significant
> performance advantage and hence didn't really pursue the integration. The
> other side of it was that it would require a significant amount of work to
> refactor major parts of Drill.
> >
> > I don't know the internal politics, but this was one of the major points
> of diversion between Dremio and Drill.
> >
> > With that said, there was a renewed discussion on the list [2] where
> Paul Rogers proposed what he described as a "Crude but Effective" approach
> to an Arrow integration.
> >
> > This is in the email link but here was a part of Paul's email:
> >
> >> Charles, just brainstorming a bit, I think the easiest way to start is
> to create a simple, stand-alone server that speaks Arrow to the client, and
> uses the native Drill client to speak to Drill. The native Drill client
> exposes Drill value vectors. One trick would be to convert Drill vectors to
> the Arrow format. I think that data vectors are the same format. Possibly
> offset vectors. I think Arrow went its own way with null-value (Drill's
> is-set) vectors. So, some conversion might be a no-op, others might need to
> rewrite a vector. Good thing, this is purely at the vector level, so would
> be easy to write. The next issue is the one that Parth has long pointed
> out: Drill and Arrow each have their own memory allocators. How could we
> share a data vector between the two? The simplest initial solution is just
> to copy the data from Drill to Arrow. Slow, but transparent to the client.
> A crude first-approximation of the development steps:
> >>
> >> A crude first-approximation of the development steps:
> >> 1. Create the client shell server.
> >> 2. Implement the Arrow client protocol. Need some way to accept a query
> and return batches of results.
> >> 3. Forward the query to Drill using the native Drill client.
> >> 4. As a first pass, copy vectors from Drill to Arrow and return them to
> the client.
> >> 5. Then, solve that memory allocator problem to pass data without
> copying.
> >
> > One point that Paul made was that these pieces are fairly discrete and
> could be implemented without refactoring major components of Drill. Of
> course, this could be something

Re: [DISCUSS] Restarting the Arrow Conversation

2022-01-03 Thread Ted Dunning

As a little bit of perspective from somebody who *was* at MapR at the time,
here are my recollections.

Arrow is pretty much the value vectors from Drill with some lessons learned
and all dependencies removed so that Arrow can be consumed separately from
Drill.

The spinout of the Dremio team didn't happen because of the lack of
integration with Arrow ... it was more the other way around ... because a
significant chunk of the Drill team left to form Dremio, the driving force
that could have pushed for integration just wasn't around any more because
they were off doing Dremio and weren't working on Drill any more very much.
The motive for the spinout had mostly to do with the fact that Tomer and
Jacques recognized the opportunity to build a largely in-memory analytical
engine based on zero serialization techniques and also recognized that this
could never be a priority for MapR because it was outside the center of
mass there. Once the Dremio team was out, though, they had a huge need for
interoperability with systems like Spark and Cassandra, and they needed to
not impose all of Drill as a dependency if they wanted these other systems
to take on Arrow.

This history doesn't really impact the merits or methods of integrating
present-day Drill with Arrow, but it is nice to get the story the right way
around.



On Mon, Jan 3, 2022 at 8:00 AM Charles Givre  wrote:

> Hello all,
> There was a discussion in a recently closed PR [1] with a discussion
> between z0ltrix, James Turton and a few others about integrating Drill with
> Apache Arrow and wondering why it was never done.  I'd like to share my
> perspective as someone who has been around Drill for some time but also as
> someone who never worked for MapR or Dremio.  This just represents my
> understanding of events as an outsider, and I could be wrong about some or
> all of this.   Please forgive (or correct) any inaccuracies.
>
> When I first learned of Arrow and the idea of integrating Arrow with
> Drill, the thing that interested me the most was the ability to move data
> between platforms without having to serialize/deserialize the data.  From
> my understanding, MapR did some research and didn't find a significant
> performance advantage and hence didn't really pursue the integration.  The
> other side of it was that it would require a significant amount of work to
> refactor major parts of Drill.
>
> I don't know the internal politics, but this was one of the major points
> of diversion between Dremio and Drill.
>
> With that said, there was a renewed discussion on the list [2] where Paul
> Rogers proposed what he described as a "Crude but Effective" approach to an
> Arrow integration.
>
> This is in the email link but here was a part of Paul's email:
>
> > Charles, just brainstorming a bit, I think the easiest way to start is
> to create a simple, stand-alone server that speaks Arrow to the client, and
> uses the native Drill client to speak to Drill. The native Drill client
> exposes Drill value vectors. One trick would be to convert Drill vectors to
> the Arrow format. I think that data vectors are the same format. Possibly
> offset vectors. I think Arrow went its own way with null-value (Drill's
> is-set) vectors. So, some conversion might be a no-op, others might need to
> rewrite a vector. Good thing, this is purely at the vector level, so would
> be easy to write. The next issue is the one that Parth has long pointed
> out: Drill and Arrow each have their own memory allocators. How could we
> share a data vector between the two? The simplest initial solution is just
> to copy the data from Drill to Arrow. Slow, but transparent to the client.
> A crude first-approximation of the development steps:
> >
> > A crude first-approximation of the development steps:
> > 1. Create the client shell server.
> > 2. Implement the Arrow client protocol. Need some way to accept a query
> and return batches of results.
> > 3. Forward the query to Drill using the native Drill client.
> > 4. As a first pass, copy vectors from Drill to Arrow and return them to
> the client.
> > 5. Then, solve that memory allocator problem to pass data without
> copying.
>
> One point that Paul made was that these pieces are fairly discrete and
> could be implemented without refactoring major components of Drill.  Of
> course, this could be something for Drill 2.0.  At a minimum, could we take
> the conversation off of the PR and put it in the email list? ;-)
>
> Let's discuss... All ideas are welcome!
>
> Best,
> -- C
>
>
> [1]: https://github.com/apache/drill/pull/2412 <
> https://github.com/apache/drill/pull/2412>
> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l <
> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
>
>
>
>

Re: Drill Wiki Access

2021-11-24 Thread Ted Dunning

The normal open source action for preventing problems like this is to leave
permissions open and monitor all changes.

Then respond to problems as they (rarely) arise.


On Wed, Nov 24, 2021 at 8:33 AM Rumar, Maksym  wrote:

> "Any destructive action could be reverted because the wiki is a
> repository itself"
> Yes, but if we leave the wiki open, then, how we can be sure if the wiki
> wouldn't be destructed? For example, if would be added or removed some
> lines deliberately or not. How and when we will know about it? It already
> sounds wrong: "to verify changes after it's applying".
>
> And a question is even not in spammers. Of course, they are dangerous, but
> first of all, we talk about the quality of documentation. For example,
> someone decided to document, that Drill can't do something, but in reality,
> it does. Then, someone other, when reading this wrong information on the
> official source of the project will believe in it. I think it's wrong and
> it's dangerous.
>
> Regards,
> Maksym
>
> 
> Від: James Turton 
> Надіслано: 24 листопада 2021 р. 18:04
> Кому: dev@drill.apache.org ; Rumar, Maksym <
> maksym.ru...@hpe.com>
> Тема: Re: Drill Wiki Access
>
> Hi Maksym
>
> This is a good point, thank you.  I wonder if we could restrict the wiki
> to "collaborators", or whatever the relevant Github concept is for those
> who are able to push to the code repo.  I also wonder if there is any
> need.  Any destructive action could be reverted because the wiki is a
> repository itself.  I wonder if desperate spammers go so far as to
> deface open source project wikis, it wouldn't surprise me.
>
> James
>
> On 2021/11/24 17:57, Rumar, Maksym wrote:
> > Hi all,
> >
> > I just found that I can add and edit any page on Drill Wiki. So, it
> means, that anybody can add and remove anything he would like. What do you
> think about it? Whether project should have open documentation for all?
> >
> > As I know, GitHub doesn't support pull requests for the wiki repository,
> so it's a question of how should look process of changing Drill wiki. What
> are your thoughts about it?
> >
> > Regards,
> > Maksym
> >
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

2021-11-17 Thread Ted Dunning

I think that these would be significant improvements.

The current behavior is pretty painful on average. Better defaults and just
a bit of deduction could pay off big. I even think that the presence of
headers might be pretty reliably inferred.



On Wed, Nov 17, 2021 at 4:31 PM Charles Givre  wrote:

> Hello Drill Community,
> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
>
> The Problems:
> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
>
> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>
> Proposed Changes:
> The overall goal is to make it easier to query CSV data and also to make
> the behavior more consistent across format plugins.
> 1.  Change the default behavior and set the extractHeaders to true.
> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
>
> Since there are some breaking changes here, I'd like to ask if people have
> any strong feelings on this topic or suggestions.
> Thanks!,
> -- C
>
>
>
>

Re: webUI Bug

2021-11-05 Thread Ted Dunning

Nathan,

I think you pasted some images into your message. They didn't make it
through to the mailing list.

Can you select text instead of images? Or push the images to a public place
like imgur?

On Thu, Nov 4, 2021 at 11:31 PM Caballero, Nathan (Volpe)
 wrote:

> Hello,
>
>
>
> I continue to experience a bug on the webUI where if any number is filled
> into the ‘Limit results to’ field, the following error is met. The query
> runs perfectly so long as any value filled out in this field is deleted.
> This bug occurs whether the field is ‘checked’ or ‘unchecked’.
>
>
>
> Thank You
>
>
>
>

Re: Parquet compression codecs

2021-09-29 Thread Ted Dunning

A blog is a great idea.

I am curious about how much compression costs.


On Wed, Sep 29, 2021 at 5:37 AM luoc  wrote:

>
> James, you are doing fine.
> Is it possible to post a new blog in the website for this?
>
> > 在 2021年9月29日，20:27，James Turton  写道：
> >
> > Hi all
> >
> > We've got support for reading and writing using additional Parquet
> compression codecs in master now.  Here are the footprints of a 25M record
> dataset compressed by Drill with different codecs.
> >
> > | Codec  | Size on disk (Mb) |
> > | -- | - |
> > | brotli |   87  |
> > | gzip   |   80  |
> > | lz4|  100.6|
> > | lzo|  100.8|
> > | snappy |  192  |
> > | zstd   |   85  |
> > | none   | 2152  |
> >
> > I haven't made measurements of (de)compression speed differences myself
> but there are many such benchmarks around on the web, and the differences
> can be big *if* you've got a workload that is CPU bound by
> (de)compression.  Beyond that there are the usual considerations like
> better utilisation of the OS page cache by the higher compression ratio
> codecs, less I/O when data must come from disk, etc.  Zstd is probably the
> one I'll be putting into `store.parquet.compression` myself at this point.
> >
> > Happy Drilling!
> > James
>
>

Re: New Docker images published automatically

2021-09-20 Thread Ted Dunning

This is great news.

Makes me think that these might be the best way to try Drill out as well,
especially where containers have low overhead (i.e. on Linux)

On Mon, Sep 20, 2021 at 4:32 AM luoc  wrote:

> Hello James,
>   Great work. Is it possible to add this NOTICE to Github wiki or docs of
> website?
>
> > 在 2021年9月20日，19:27，James Turton  写道：
> >
> > Hi all
> >
> > If you browse to https://hub.docker.com/r/apache/drill/tags, you'll see
> that we've just started publishing the following new Docker images based on
> snapshots of Drill master.
> >
> > apache/drill:master-openjdk-8 (=master) snapshot of master running on
> the openjdk:8 base image
> > apache/drill:master-openjdk-11  snapshot of master running on
> the latest supported LTS OpenJDK base image
> > apache/drill:master-openjdk-14  snapshot of master running on
> the latest supported OpenJDK base image
> >
> > The latest *released* version of Drill, which remains recommended for
> production deployments, is still
> >
> > apache/drill:latest latest release running on the
> openjdk:8 base image
> >
> > Starting from the *next* release (1.20) we will also publish
> >
> > apache/drill:latest-openjdk-8 (=latest) latest release running on the
> openjdk:8 base image
> > apache/drill:latest-openjdk-11  latest release running on the
> latest supported LTS OpenJDK base image
> > apache/drill:latest-openjdk-14  latest release running on the
> latest supported OpenJDK base image
> >
> > each of which will also be tagged by Drill version, so following tags
> will be identical to those in the preceding paragraph
> >
> > apache/drill:1.20.0-openjdk-8 (=latest) latest release running on the
> openjdk:8 base image
> > apache/drill:1.20.0-openjdk-11  latest release running on the
> latest supported LTS OpenJDK base image
> > apache/drill:1.20.0-openjdk-14  latest release running on the
> latest supported OpenJDK base image
> >
> > Coming back to what's different *today*, the short of it is that you
> have containerised snapshots of master for testing unreleased code or newer
> JDK images.
> >
> > Regards
> > James
>
>

Re: Query the HBase data in Drill

2021-08-24 Thread Ted Dunning

I know somebody who is querying a very large table and has trouble with
pushdown.

They are looking for values indexed by primary key with a query like
"select * from table where key in s".  If s has a very small number of
values, this turns into primary key access, but if there are more than just
a few, it becomes a scan.

The situation that would be interesting to detect is where s has a few
tightly clustered groups. The ideal strategy would be to scan each group.
How this might be detected isn't clear to me, but it would make a massive
difference to this kind of query.

Currently, the best alternative is to try to avoid this kind of query and
build a data flow such that each cluster of keys flows into a separate
query. This would be made easier if a common table expression (CTE) query
could be done without having the optimizer try to globally optimize back to
a single big scan.

Anyway, I have absolutely no concrete suggestions for making this work, but
the need is there.

On Tue, Aug 24, 2021 at 4:39 AM luoc  wrote:

> Hello Guys,
>   Will you use Drill to query Apache HBase? If so, what new feature would
> you like to see in HBase storage plugin? In addition, Drill supported the
> Apache Cassandra since 1.19.
> Absolutely… Could you tell me what your most common storage plugin (or
> data format) are? Thanks for your time.
>
>
> -- luoc

Re: Strange query crash

2021-08-11 Thread Ted Dunning

Interesting.

I will see what I can do about the upgrade.



On Wed, Aug 11, 2021 at 7:16 AM luoc  wrote:

> Hello Ted,
>   I think the error stack looks can be deceiving. There is a ticket
> (DRILL-4254 <https://issues.apache.org/jira/browse/DRILL-4254>) related
> to the issue. I recommend that you upgrade to latest version if is cause by
> the schema change.
>
> > 2021年8月11日 上午4:18，Ted Dunning  写道：
> >
> >
> > I am running a moderate sized data reduction task and getting strange
> crash with Drill 1.16.  Stack trace is shown below.
> >
> > The query is this:
> >
> > ```
> > create table dfs.home.`mrms/grib-07.parquet`
> > partition by (box)
> > as
> > with
> > t1 as (
> >select value as precip, datetime as t, cast(latitude as double) as
> latitude, cast(longitude as double) longitude
> >from table(dfs.home.`mrms/*grib*csv`(type => 'text', fieldDelimiter
> => ',', extractHeader => true))
> >limit 4)
> >
> > select precip, latitude, longitude, floor(latitude)*100 -
> floor(longitude) box
> > from t1
> > order by box, latitude, longitude, t
> > ```
> >
> > The basic idea is that we are scanning 740 CSV files containing about
> 19GB of data and I want to write them to a partitioned parquet dataset. I
> am progressively increasing the number of lines processed to verify things
> are working. The process worked fine at 200M rows of data and fails at
> 400M. The text of the error is disconcerting because it claims that there
> is an index error, but the index given is in the specified range.
> >
> > Does anybody have any ideas on this? I haven't tried more recent
> versions.
> >
> >
> > Fragment 3:0
> >
> > Please, refer to logs for more information.
> >
> > [Error Id: e681aca3-78b7-496a-9af1-7ec34fcf31a9 on nodec:31010]
> >   at
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:630)
> ~[drill-common-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at 
> > org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:363)
> [drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at 
> > org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:219)
> [drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at 
> > org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:329)
> [drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
> [drill-common-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [na:1.8.0_292]
> >   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [na:1.8.0_292]
> >   at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
> > Caused by: java.lang.IllegalStateException:
> java.lang.IndexOutOfBoundsException: index: 131071, length: 19 (expected:
> range(0, 131072))
> >   at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.doWork(RemovingRecordBatch.java:69)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:126)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:116)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:141)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:126)
> ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
> >   at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRe

Strange query crash

2021-08-10 Thread Ted Dunning



I am running a moderate sized data reduction task and getting strange crash 
with Drill 1.16.  Stack trace is shown below.

The query is this:

```
create table dfs.home.`mrms/grib-07.parquet`
partition by (box)
as 
with
t1 as (
select value as precip, datetime as t, cast(latitude as double) as 
latitude, cast(longitude as double) longitude
from table(dfs.home.`mrms/*grib*csv`(type => 'text', fieldDelimiter => ',', 
extractHeader => true))
limit 4)

select precip, latitude, longitude, floor(latitude)*100 - floor(longitude) box
from t1
order by box, latitude, longitude, t
```

The basic idea is that we are scanning 740 CSV files containing about 19GB of 
data and I want to write them to a partitioned parquet dataset. I am 
progressively increasing the number of lines processed to verify things are 
working. The process worked fine at 200M rows of data and fails at 400M. The 
text of the error is disconcerting because it claims that there is an index 
error, but the index given is in the specified range.

Does anybody have any ideas on this? I haven't tried more recent versions.


Fragment 3:0

Please, refer to logs for more information.

[Error Id: e681aca3-78b7-496a-9af1-7ec34fcf31a9 on nodec:31010]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:630)
 ~[drill-common-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:363)
 [drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:219)
 [drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:329)
 [drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_292]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_292]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
Caused by: java.lang.IllegalStateException: 
java.lang.IndexOutOfBoundsException: index: 131071, length: 19 (expected: 
range(0, 131072))
at 
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.doWork(RemovingRecordBatch.java:69)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:126)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:116)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:141)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:126)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:116)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:63)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:141)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:186)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) 
~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:152)
 ~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) 
~[drill-java-exec-1.16.0.10-mapr.jar:1.16.0.10-mapr]
at

Re: [DISCUSS] Drill 1.19.0 release

2021-06-14 Thread Ted Dunning

 Laurent Goujon <
> laur...@dremio.com>
> >>>> >>>> wrote:
> >>>> >>>>>>>
> >>>> >>>>>>> Sadly, I haven't heard from people regarding the patches. At
> the same
> >>>> >>>>>> time,
> >>>> >>>>>>> I think we held the window open for merging the changes for a
> very
> >>>> >>>> long
> >>>> >>>>>>> time. Unless there's objection, I'm planning to merge the
> Guava and
> >>>> >>>>>>> Jetty/Hadoop pull requests later today, and doing the first
> RC for
> >>>> >>>> Drill
> >>>> >>>>>>> 1.19.0
> >>>> >>>>>>>
> >>>> >>>>>>> Here are the pull request links:
> >>>> >>>>>>> * https://github.com/apache/drill/pull/2202
> >>>> >>>>>>> * https://github.com/apache/drill/pull/2236
> >>>> >>>>>>>
> >>>> >>>>>>> Laurent
> >>>> >>>>>>>
> >>>> >>>>>>>
> >>>> >>>>>>> On Wed, May 26, 2021 at 11:59 AM Laurent Goujon <
> laur...@dremio.com>
> >>>> >>>>>> wrote:
> >>>> >>>>>>>
> >>>> >>>>>>>> After several retries, the Guava checks successfully passed:
> >>>> >>>>>>>> https://github.com/apache/drill/pull/2202
> >>>> >>>>>>>>
> >>>> >>>>>>>> Charles, can we proceed on merging your change?
> >>>> >>>>>>>>
> >>>> >>>>>>>> Laurent
> >>>> >>>>>>>>
> >>>> >>>>>>>> On Tue, May 25, 2021 at 10:24 PM Laurent Goujon <
> laur...@dremio.com>
> >>>> >>>>>>>> wrote:
> >>>> >>>>>>>>
> >>>> >>>>>>>>> Just an update. There's a patch for updating both Jetty and
> Hadoop
> >>>> >>>> (at
> >>>> >>>>>>>>> the same time) as those changes are co-dependent:
> >>>> >>>>>>>>> https://github.com/apache/drill/pull/2236
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> As for the Guava patch, I'd be happy to help, but I'm not
> sure
> >>>> >>>> what's
> >>>> >>>>>>>>> left. As far as I can tell the shaded version of Guava has
> been
> >>>> >>>>>> updated,
> >>>> >>>>>>>>> but the build is failing. The security vulnerabilities for
> Guava are
> >>>> >>>>>>>>> moderate (and actually it seems a fix for CVE-2020-8908
> would
> >>>> >>>> require a
> >>>> >>>>>>>>> code change instead of a Guava update.
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> Since this has been almost a month since we started this
> release
> >>>> >>>>>> process,
> >>>> >>>>>>>>> I wonder if we still want to wait on this patch, or if we
> should
> >>>> >>>> move
> >>>> >>>>>> it to
> >>>> >>>>>>>>> the next release.
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> Let me know what people think,
> >>>> >>>>>>>>>
> >>>> >>>>>>>>> On Tue, May 25, 2021 at 8:24 AM Laurent Goujon <
> laur...@dremio.com>
> >>>> >>>>>>>>> wrote:
> >>>> >>>>>>>>>
> >>>> >>>>>>>>>> Anything I can help with?
> >>>> >>>>>>>>>>
> >>>> >>>>>>>>>> On Tue, May 25, 2021 at 7:02 AM Charles Givre <
> cgi...@gmail.com>
> >>>> >>>>>> w

Re: [ANNOUNCE] Apache Drill 1.19.0 Released

2021-06-14 Thread Ted Dunning

Congratulations to Laurent as a first time release manager!

Well done.



On Mon, Jun 14, 2021 at 5:56 PM Laurent Goujon  wrote:

> On behalf of the Apache Drill community, I am happy to announce the release
> of Apache Drill 1.19.0.
>
> Drill is an Apache open-source SQL query engine for Big Data exploration.
> Drill is designed from the ground up to support high-performance analysis
> on the semi-structured and rapidly evolving data coming from modern Big
> Data applications, while still providing the familiarity and ecosystem of
> ANSI SQL, the industry-standard query language. Drill provides
> plug-and-play integration with existing Apache Hive and Apache HBase
> deployments.
>
> For information about Apache Drill, and to get involved, visit the project
> website [1].
>
> Total of 115 JIRA's are resolved in this release of Drill with following
> new features and improvements [2]:
>
>  - Cassandra Storage Plugin (DRILL-92)
>  - Elasticsearch Storage Plugin (DRILL-3637)
>  - XML Storage Plugin (DRILL-7823)
>  - Splunk Storage Plugin (DRILL-7751)
>  - Avro with schema registry support for Kafka (DRILL-5940)
>  - Secure mechanism for specifying storage plugin credentials (DRILL-7855)
>  - Linux ARM64 based system support (DRILL-7921)
>  - Rowset based JSON reader (DRILL-6953)
>  - Use streaming for REST JSON queries (DRILL-7733)
>  - Several plugins have been converted to the Enhanced Vector Framework
> (EVF)
>- Convert SequenceFiles to EVF (DRILL-7525)
>- Convert SysLog to EVF (DRILL-7532)
>- Convert Pcapng to EVF (DRILL-7533)
>- Convert HTTPD format plugin to EVF (DRILL-7534)
>- Convert Image Format to EVF (DRILL-7533)
>
> For the full list please see release notes [3].
>
> The binary and source artifacts are available here [4].
>
> Thanks to everyone in the community who contributed to this release!
>
> 1. https://drill.apache.org/
> 2. https://drill.apache.org/blog/2021/06/10/drill-1.19-released/
> 3. https://drill.apache.org/docs/apache-drill-1-19-0-release-notes/
> 4. https://drill.apache.org/download/
>

Re: [RESULT] [VOTE] Release Apache Drill 1.19.0 RC1

2021-06-09 Thread Ted Dunning

Thanks.

I figured you would be ahead of me on this.

On Wed, Jun 9, 2021 at 12:27 PM  wrote:

> Hello Ted,
>
> Yes, initially I tried both options.
> I have also left a comment on the ticket, hope it will be resolved soon.
>
> Kind regards,
> Volodymyr Vysotskyi
>
> On 2021/06/09 19:04:02, Ted Dunning  wrote:
> > Vova,>
> >
> > Gavin responded on INFRA-21981 to the effect that upload should go to
> the>
> > dev side and then svn mv should be used to move to the release side.>
> >
> > Is that what you tried to do?>
> >
> >
> >
> > On Wed, Jun 9, 2021 at 10:25 AM  wrote:>
> >
> > > I have some issues, will deploy after>
> > > https://issues.apache.org/jira/browse/INFRA-21981 is fixed.>
> > >>
> > > On 2021/06/09 16:27:12, vo...@apache.org wrote:>
> > > > Hello Laurent,>>
> > > >>
> > > > I’ll publish them later today.>>
> > > >>
> > > > Kind regards,>>
> > > > Volodymyr Vysotskyi>>
> > > >>
> > > >>
> > > > On 2021/06/09 04:39:50, Laurent Goujon  wrote: >>
> > > > > Hi,> >>
> > > > > >>
> > > > > May I kindly ask for a PMC to push the RC1 artifacts to the dist>
> > > repository> >>
> > > > > per instructions at> >>
> > > > > https://github.com/apache/drill/blob/master/docs/dev/Release.md?>
> >>
> > > > > >>
> > > > > The artifacts are available at> >>
> > > > > https://home.apache.org/~laurent/drill/releases/1.19.0/rc1/> >>
> > > > > >>
> > > > > Laurent> >>
> > > > > >>
> > > > > On Tue, Jun 8, 2021 at 9:36 PM Laurent Goujon >
> > > wrote:> >>
> > > > > >>
> > > > > > Hi all,> >>
> > > > > >> >>
> > > > > > The vote passes. Thanks to everyone who has tested the release>
> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > > > candidate and given their comments and votes. Final tally:> >>
> > > > > >> >>
> > > > > > 3x +1 (binding): Laurent, Ted, Vova> >>
> > > > > >> >>
> > > > > > No 0s or -1s.> >>
> > > > > >> >>
> > > > > > I'll start the process for pushing the release artifacts and
> send>
> > > an> >>
> > > > > > announcement once propagated.> >>
> > > > > >> >>
> > > > > > Kind regards,> >>
> > > > > >> >>
> > > > > > Laurent> >>
> > > > > >> >>
> > > > > >>
> >

Re: [RESULT] [VOTE] Release Apache Drill 1.19.0 RC1

2021-06-09 Thread Ted Dunning

Vova,

Gavin responded on INFRA-21981 to the effect that upload should go to the
dev side and then svn mv should be used to move to the release side.

Is that what you tried to do?



On Wed, Jun 9, 2021 at 10:25 AM  wrote:

> I have some issues, will deploy after
> https://issues.apache.org/jira/browse/INFRA-21981 is fixed.
>
> On 2021/06/09 16:27:12, vo...@apache.org wrote:
> > Hello Laurent,>
> >
> > I’ll publish them later today.>
> >
> > Kind regards,>
> > Volodymyr Vysotskyi>
> >
> >
> > On 2021/06/09 04:39:50, Laurent Goujon  wrote: >
> > > Hi,> >
> > > >
> > > May I kindly ask for a PMC to push the RC1 artifacts to the dist
> repository> >
> > > per instructions at> >
> > > https://github.com/apache/drill/blob/master/docs/dev/Release.md?> >
> > > >
> > > The artifacts are available at> >
> > > https://home.apache.org/~laurent/drill/releases/1.19.0/rc1/> >
> > > >
> > > Laurent> >
> > > >
> > > On Tue, Jun 8, 2021 at 9:36 PM Laurent Goujon 
> wrote:> >
> > > >
> > > > Hi all,> >
> > > >> >
> > > > The vote passes. Thanks to everyone who has tested the release> >
> > > >> >
> > > >
> > > >
> > > > candidate and given their comments and votes. Final tally:> >
> > > >> >
> > > > 3x +1 (binding): Laurent, Ted, Vova> >
> > > >> >
> > > > No 0s or -1s.> >
> > > >> >
> > > > I'll start the process for pushing the release artifacts and send
> an> >
> > > > announcement once propagated.> >
> > > >> >
> > > > Kind regards,> >
> > > >> >
> > > > Laurent> >
> > > >> >
> > > >

Re: [VOTE] Release Apache Drill 1.19.0 - RC1

2021-06-05 Thread Ted Dunning

+1

I checked signatures on the source and binary tar files.

I extracted the binary tar file and ran some simple queries and validated
that the web UI came up when I started embedded (found a dead link in the
docs ... filed a JIRA for that).

I ran `mvn package` at the root level of the source as extracted from the
tar file. That has been running for 10-20 minutes with no errors in the
tests, but I am not planning to wait for completion.

I tried this new version on my broken csv example and it handled the
situation nicely. This is improved from 1.16.

On Fri, Jun 4, 2021 at 11:35 PM Laurent Goujon  wrote:

> Hi all,
>
> I'd like to propose the first release candidate (RC1) of Apache Drill,
> version 1.19.0.
> The release candidate covers a total of 109 resolved JIRAs [1]. Thanks
> to everyone who contributed to this release.
> The tarball artifacts are hosted at [2] and the maven artifacts are
> hosted at [3].
> This release candidate is based on commit
> ad3f344ac21e0462aa82f51f648a21a0554cf368 located at [4].
> Please download and try out the release.
>
> The vote ends at 5 PM UTC (9 AM PDT, 7 PM EET, 10:30 PM IST), June 8, 2021.
>
> [ ] +1
> [ ] +0
> [ ] -1
> Here's my vote: +1
> Laurent
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12348331
> [2] https://home.apache.org/~laurent/drill/releases/1.19.0/rc1/
> 
> [3] https://repository.apache.org/content/repositories/orgapachedrill-1086
> 
> [4] https://github.com/laurentgo/drill/commits/drill-1.19.0
>

[jira] [Created] (DRILL-7949) documentation error - missing link

2021-06-05 Thread Ted Dunning (Jira)

Ted Dunning created DRILL-7949:
--

 Summary: documentation error - missing link
 Key: DRILL-7949
 URL: https://issues.apache.org/jira/browse/DRILL-7949
 Project: Apache Drill
  Issue Type: Task
Reporter: Ted Dunning


In checking rc1 for 1.19, I noted that this page:

[https://drill.apache.org/docs/configuring-storage-plugins/]

has a link to "Start the web UI" to 
[https://drill.apache.org/docs/starting-the-web-console/]

and that page does not exist.

I think that link should go to 
[https://drill.apache.org/docs/starting-the-web-ui/]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [VOTE] Release Apache Drill 1.19.0 - RC0

2021-06-04 Thread Ted Dunning

The release manager really should be deciding what makes the cut and what
does not.

That is the origin of the tradition of allowing any committer to be a
release manager. If somebody is unhappy about the content going into a
release, they can roll another release that meets their particular release.

Laurent is managing the 1.19 release and he should be managing it.

On Fri, Jun 4, 2021 at 11:21 AM Laurent Goujon  wrote:

> I think you missed my point:
>
> - Not all bugs are equal and not all of them may cause to discard the
> current evaluation of a release candidate. There are currently 1631 open
> bugs for the project according to the Apache Drill JIRA: does it mean we
> should wait for all those bugs to be fixed before we can release? What
> makes this bug special compared to the others?
> - Assuming the previous point is cleared and that the bug should indeed be
> part of the release, it would be good to let the release manager handle it
> or at least coordinate with them instead of doing it on your own. That's
> usually what I've seen done in other projects and it seems to be a
> reasonable thing to do as the release manager might be already deep in
> patches merge and evaluation, and unexpected changes to the tree might
> cause extra work which could have been avoided. Or just because that's the
> respectful/polite thing to do...
>
> On Thu, Jun 3, 2021 at 11:11 PM luoc  wrote:
>
> > Laurent,
> >   Thanks for doing this. RC0 is no longer eligible for the next step
> > operation. It is a consensus that we cannot release a version with known
> > issues (the pull request mark as `bug`). In fact, Drill's release process
> > is not friendly, and we will put these discussion after the release. Now
> > our focus is on preparing for RC1. BTW, You're doing great.
> >
> > > 2021年6月4日 下午1:20，Laurent Goujon  写道：
> > >
> > > You actually went ahead and merged those patches without waiting while
> I
> > > was hoping we could get some consensus first :(
> > >
> > > Can I just ask you to please respect the effort I'm putting in
> following
> > > what I think is the release process? If people think I'm not following
> > the
> > > proper steps or that I'm not doing a good job at doing it, I'll gladly
> > > accept feedback and will do my best to address it, but going over me
> > isn't
> > > helping me or the future volunteers for the next releases which might
> be
> > > also wondering what's the release process should be.
> > > Meanwhile I'll wait to get a review for the DRILL-7945 patch fixing the
> > > Guava regression, and hopefully I should be able to do another release
> > > candidate tomorrow.
> > >
> > > Laurent
> > >
> > > On Thu, Jun 3, 2021 at 5:46 PM luoc  wrote:
> > >
> > >>
> > >> The DRILL-7945 blocked the release. So, I'm ready to merge the
> > DRILL-7937
> > >> and DRILL-7940 for bugfix.
> > >>
> > >>> 在 2021年6月4日，01:15，Laurent Goujon  写道：
> > >>>
> > >>> Hey guys,
> > >>>
> > >>> Can we please stop changing the goal post again and again? The fact
> > that
> > >>> some of those pull requests are ready to merge should not be the sole
> > >>> consideration when to do a next release candidate.
> > >>>
> > >>> I've been asking several times on this mailing list about what we
> want
> > to
> > >>> include or not, and we got an agreement several times about it, and
> > >> several
> > >>> times we are now having this conversation.
> > >>> IMHO, I would not include DRILL-7941, DRILL-7942 and DRILL-7943:
> those
> > >> are
> > >>> new enhancements impacting Drill tests (not even the main product)
> and
> > I
> > >> do
> > >>> not understand the rush in making them part of the release.
> > Specifically
> > >>> for the JUnit 5 update, I think the change is misleading because it
> > looks
> > >>> like it's only the introduction of JUnit5 in one test class and
> > >> everything
> > >>> else still uses JUnit 4, so I would hardly call it an upgrade...
> > >>>
> > >>> As for DRILL-7937 and DRILL-7940, the issues were open in the last 3
> > days
> > >>> ago, but they do not seem to be regressions since 1.18.0, just gaps
> in
> > >> what
> > >>> Drill provides. Personally since we are this deep in the release, I
> > would
> > >>> also skip these one too. But if people have more contexts on those,
> > maybe
> > >>> we can agree they should be merged?
> > >>>
> > >>> Laurent
> > >>>
> > >>>
> >  On Thu, Jun 3, 2021 at 6:10 AM Charles Givre 
> > wrote:
> > 
> >  There are like 5 minor PRs that are approved and awaiting merge.
> I'd
> > >> vote
> >  that we include them.  Specifically:
> > 
> >  DRILL-7943: Update Hamcrest
> >  DRILL-7942: Update Mockito
> >  DRILL-7941: Update junit to 5.7.2
> >  DRILL-7937:  Parquet decimal error
> >  DRILL-7940: Fix Kafka Key
> > 
> >  These are all approved and can be merged.
> > 
> >  -- C
> > 
> > >> On Jun 3, 2021, at 9:01 AM, luoc  wrote:
> > >
> > >
> > > DRILL-7940, too
> > >
> > >> 在

Re: Feature/Question

2021-06-01 Thread Ted Dunning

Akshay,

One of the major design goals of Drill is to run individual queries in a
parallelized fashion. It does that very well, under the right conditions.

So the first issue is the installation. You have to have a number of
drill-bits that are configured to work together. Your description of "running
the queries from drill-localhost instance" sounds at odds with this.

If you look at this page, there is a good description of how Drill
processes queries:

https://drill.apache.org/architecture/

This page echoes what I said earlier in more detail and with pictures.

To understand how to install Drill in distributed mode, this page can help:

https://drill.apache.org/docs/getting-started/

Look down the table of contents to find "Installing Drill in Distributed
Mode".

Now, there are issues that can prevent Drill from being able to parallelize
queries. In particular, if your data is in a single file the actual reading
of the data is likely to happen in a single thread of execution. If the
data that comes from that reading is not large and especially if your query
does not involve sorting or aggregating the data, Drill might opt not to
parallelize the processing of the data. This can happen even if your data
is spread across many files if Drill can determine that all of the
interesting data is in a single file via information about how the file is
partitioned.

Does this help?




On Tue, Jun 1, 2021 at 6:48 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasi...@bloomberg.net> wrote:

> "I believe that any of the 5 drill machines can handle queries completely
> symmetrically. When a query is received, the planning is done and execution
> fragments are scheduled on the other nodes."
>
> Thats interesting - I've not come across this. I currently have a lot of
> history of data partitioned in s3, and I've tried different queries -
> however the load has NOT been parallelized/distributed.
> Therefore - I had an understanding distributed mode refers to multiple
> queries being able to run on a single node - but not the other way around.
>
> I was running the queries from drill-localhost instance from one of the
> nodes. Is there something else I should try to achieve the above behavior ?
>
> From: ted.dunn...@gmail.com At: 05/30/21 18:51:03 UTC-4:00
> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) 
> Cc: dev@drill.apache.org
> Subject: Re: Feature/Question
>
>
> Here are some more answers:
>
>
> On Thu, May 27, 2021 at 10:09 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
> abhasi...@bloomberg.net> wrote:
>
>> Hi Ted,
>>
>> Yes sure - below are the 2 reasons for it -
>>
>> 1) If I run 5 drill machines in a cluster, all connected to a single end
>> point at s3, I'll have to use the machines to create the parquet files.
>> Now, there are 2 sub questions here -
>>
>> - I'm not sure if a single drill end point is exposed for me to query ...
>> a unique cluster ID I can use where all requests will be load balanced ?
>>
>
> I believe that any of the 5 drill machines can handle queries completely
> symmetrically. When a query is received, the planning is done and execution
> fragments are scheduled on the other nodes.
>
> As such, you can either build a load balancer in front of the cluster or
> you can do roughly the same thing using DNS round-robin. It won't make a
> lot of difference, in practice, though because the load is spread around
> pretty well even if only one node does all of the planning (at least if
> your queries involve a lot of work).
>
>
>> - What if the node goes down ? For instance, on a single node (say A in
>> above example) - one user is running a read query & at the same time I run
>> a create table query ? That would block and congest the node.
>>
>
> If a node goes down, any query involving it will fail. The loss will be
> detected in a few seconds and any queries accepted by the cluster during
> that time may hang up a little bit. Once the failure has been detected,
> operation will continue without any problems. Clients may or may not retry
> their queries automatically (I think that most won't).
>
>
>>
>> 2) This is a minor one - and I could be wrong - I'm not sure drill can
>> write to s3 bucket. I think you can only put/upload files there, you cannot
>> write to it.
>>
>
> Charles' answer was on the mark here.
>
>
>

Fwd: [NOTICE] Git web site publishing to be done via .asf.yaml only as of July 1st

2021-05-31 Thread Ted Dunning

Drill is on this list.

I think that the fix is relatively trivial, but haven't examined it
carefully.

-- Forwarded message -
From: Daniel Gruno 
Date: Mon, May 31, 2021 at 6:41 AM
Subject: [NOTICE] Git web site publishing to be done via .asf.yaml only as
of July 1st
To: Users 

TL;DR: if your project web site is kept in subversion, disregard this
email please. If your project web site is using git, and you have not
deployed it via .asf.yaml, you MUST switch before July 1st or risk your
web site goes stale.

Dear Apache projects,
In order to simplify our web site publishing services and improve
self-serve for projects and stability of deployments, we will be turning
off the old 'gitwcsub' method of publishing git web sites. As of this
moment, this involves 120 web sites. All web sites should switch to our
self-serve method of publishing via the .asf.yaml meta-file. We aim to
turn off gitwcsub around July 1st.

## How to publish via .asf.yaml:
Publishing via .asf.yaml is described at:
https://s.apache.org/asfyamlpublishing
You can also see an example .asf.yaml with publishing and staging
profiles for our own infra web site at:
https://github.com/apache/infrastructure-website/blob/asf-site/.asf.yaml

In short, one puts a file called .asf.yaml into the branch that needs to
be published as the project's web site, with the following two-line
content, in this case assuming the published branch is 'asf-site':

publish:
   whoami: asf-site

It is important to note that the .asf.yaml file MUST be present at the
root of the file system in the branch you wish to publish. The 'whoami'
parameter acts as a guard, ensure that only the intended branch is used
for publishing.

## Is my project affected by this?
The quickest way to check if you need to switch to a .asf.yaml approach
is to check out site source page at
https://infra-reports.apache.org/site-source/ - if your site is listed
in yellow, you will need to switch. This page will also tell you which
branch you are currently publishing as your web site. This is (should
be) the branch that you must add a .asf.yaml meta file to.

The web site source list updates every hour. If your project site
appears in green, you are already using .asf.yaml for publishing and do
not need to make any changes.

## What happens if we miss the deadline?
If you miss the deadline, don't fret. Your site will of course still
remain online as is, but new updates will not appear till you
create/edit the .asf.yaml and set up publishing.

## Who do we contact if we have questions?
Please contact us at us...@infra.apache.org if you have any additional
questions.

With regards,
Daniel on behalf of ASF Infra.

Re: Feature/Question

2021-05-30 Thread Ted Dunning

Here are some more answers:

On Thu, May 27, 2021 at 10:09 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasi...@bloomberg.net> wrote:

> Hi Ted,
>
> Yes sure - below are the 2 reasons for it -
>
> 1) If I run 5 drill machines in a cluster, all connected to a single end
> point at s3, I'll have to use the machines to create the parquet files.
> Now, there are 2 sub questions here -
>
> - I'm not sure if a single drill end point is exposed for me to query ...
> a unique cluster ID I can use where all requests will be load balanced ?
>

I believe that any of the 5 drill machines can handle queries completely
symmetrically. When a query is received, the planning is done and execution
fragments are scheduled on the other nodes.

As such, you can either build a load balancer in front of the cluster or
you can do roughly the same thing using DNS round-robin. It won't make a
lot of difference, in practice, though because the load is spread around
pretty well even if only one node does all of the planning (at least if
your queries involve a lot of work).

> - What if the node goes down ? For instance, on a single node (say A in
> above example) - one user is running a read query & at the same time I run
> a create table query ? That would block and congest the node.
>

If a node goes down, any query involving it will fail. The loss will be
detected in a few seconds and any queries accepted by the cluster during
that time may hang up a little bit. Once the failure has been detected,
operation will continue without any problems. Clients may or may not retry
their queries automatically (I think that most won't).

>
> 2) This is a minor one - and I could be wrong - I'm not sure drill can
> write to s3 bucket. I think you can only put/upload files there, you cannot
> write to it.
>

Charles' answer was on the mark here.

Re: Feature/Question

2021-05-27 Thread Ted Dunning

Akshay,

I don't understand why you can't use Drill to create the parquet files.
Can you say more?

Is there a language constraint? A process constraint?

As I hear it, you are asking "I don't want to use Drill to create parquet,
I want to use something else". The problem is that there are tons of other
ways. I start with not understanding your needs (coz I think Drill is the
easiest way for me to create parquet files) and then have no idea which
direction you are headed.

Just a little more definition could help me (and others) help you.

On Thu, May 27, 2021 at 8:18 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasi...@bloomberg.net> wrote:

> Hi Drill Team,
>
> I've another ques - is there a python parquet module you provide/support
> which I can leverage to create .parquet & .parquet.crc files which drill
> creates.
>
> I currently have a drill cluster & I want to use it for reading the data
> but not creating the parquet files.
>
> I'm aware of other modules, but I want to preserve the speed &
> optimization of drill - so particularly looking at the module which drill
> uses to convert files to parquet & parquet.crc.
>
> My end goal here is to have a drill cluster reading data from s3 & a
> separate process to convert data to parquet & parquet.crc files & upload it
> to s3.
>
> Best,
> Akshay
>
> From: ted.dunn...@gmail.com At: 04/27/21 17:37:43 UTC-4:00
> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) 
> Cc: dev@drill.apache.org
> Subject: Re: Feature/Question
>
>
> Akshay,
>
> That's great news!
>
> On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
> abhasi...@bloomberg.net> wrote:
>
>> Hi Ted,
>>
>> Thanks for reaching out. Yes - the below worked successfully.
>>
>> I was able to create different objects in s3 like 'XXX/YYY/filename',
>> 'XXX/ZZZ/filename' and able to query like
>> SELECT * FROM XXX.
>>
>> Thanks !
>>
>> Best,
>> Akshay
>>
>> From: ted.dunn...@gmail.com At: 04/21/21 17:21:42
>> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) ,
>> dev@drill.apache.org
>> Subject: Re: Feature/Question
>>
>>
>> Akshay,
>>
>> Yes. It is possible to do what you want from a few different angles.
>>
>> As you have noted, S3 doesn't have directories. Not really. On the other
>> hand, people simulate this using naming schemes and S3 has some support for
>> this.
>>
>> One of the simplest ways to deal with this is to create a view that
>> explicitly mentions every S3 object that you have in your table. The
>> contents of this view can get a bit cumbersome, but that shouldn't be a
>> problem since users never need to know. You will need to set up a scheduled
>> action to update this view occasionally, but that is pretty simple.
>>
>> The other way is to use a naming scheme with a delimiter such as /. This
>> is described at
>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
>> If you do that and have files named (for instance) foo/a.json,
>> foo/b.json, foo/c.json and you query
>>
>> select * from s3.`foo`
>>
>> you should see the contents of a.json, b.json and c.json. See here for
>> commentary
>> 
>>
>> I haven't tried this, however, so I am simply going on the reports of
>> others. If this works for you, please report your success back here.
>>
>>
>>
>>
>>
>> On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
>> abhasi...@bloomberg.net> wrote:
>>
>>> Hi Drill Community,
>>>
>>> I'm Akshay and I'm using Drill for a project I'm working on.
>>>
>>> There is this particular use case I want to implement - I want to know
>>> if its possible.
>>>
>>> 1) Currently, we have a partition of file system and we create a view on
>>> top of it. For example, we have below directory structure -
>>>
>>> /home/product/product_name/year/month/day/*parquet
>>> /home/product/product_name_2/year/month/day/*parquet
>>> /home/product/product_name_3/year/month/day/*parquetdev
>>>
>>> Now, we create a view over it -
>>> Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as
>>> month, `dir3` as day, * from dfs.`/home/product`;
>>>
>>> Then, we can query all the data dynamically -
>>> SELECT * from temp LIMIT 5;
>>>
>>> 2) Now I want to replicate this behavior via s3. I want to ask if its
>>> possible - I was able to create a logical directory. But s3 inherently does
>>> not support directories only objects.
>>>
>>> Therefore, I was curious to know if it is supported/way to do this. I
>>> was unable to find any documentation on your website related to
>>> partitioning data on s3.
>>>
>>> Thanks for your help.
>>> Best,
>>> Akshay
>>
>>
>>
>

Re: Release and GPG key

2021-05-24 Thread Ted Dunning

I would be happy to do this. My old Apache key is still live, but it isn't
in the KEYS file yet. I can add it easily enough.

One quick note. The fact that a key is in the KEYS file is enough of a web
of trust in Apache. This is because only a committer can put it there.
There is a further cross check with the SVN file.

It is a very nice thing to do, however, to cross-sign keys. It is also a
very tricky thing to do during COVID times.

I will go ahead and cross sign Laurent's key once we have the phone call so
that we have a bit of traceability this time.

On Mon, May 24, 2021 at 4:45 PM Laurent Goujon  wrote:

> Yes, I was thinking of doing a zoom meeting where I would show proof of id
> + key id. Especially because of Covid, that seems the easiest option.
>
> On Mon, May 24, 2021, 16:08 Ted Dunning  wrote:
>
> > Laurent,
> >
> > The critical question here is how you can substantiate this key. IN
> person,
> > with a government ID, this would be easy.
> >
> > Do you know a committer personally who could vouch for you? Would you be
> > interested in having a video call where you can present some ID?
> >
> > On Mon, May 24, 2021 at 3:24 PM Laurent Goujon 
> wrote:
> >
> > > Hi,
> > >
> > > I opened a pull request to add my public GPG keys to the KEYS file at
> the
> > > root of the project:
> > > https://github.com/apache/drill/pull/2234
> > >
> > > Sadly this key is not part of the Web Of Trust, and I would need
> someone
> > > part of it to validate my key. And also a PMC member to add it to the
> > Drill
> > > release SVN repository.
> > >
> > > Anybody interested?
> > >
> > > Laurent
> > >
> >
>

Re: Release and GPG key

2021-05-24 Thread Ted Dunning

Laurent,

The critical question here is how you can substantiate this key. IN person,
with a government ID, this would be easy.

Do you know a committer personally who could vouch for you? Would you be
interested in having a video call where you can present some ID?

On Mon, May 24, 2021 at 3:24 PM Laurent Goujon  wrote:

> Hi,
>
> I opened a pull request to add my public GPG keys to the KEYS file at the
> root of the project:
> https://github.com/apache/drill/pull/2234
>
> Sadly this key is not part of the Web Of Trust, and I would need someone
> part of it to validate my key. And also a PMC member to add it to the Drill
> release SVN repository.
>
> Anybody interested?
>
> Laurent
>

Re: known bug in csv header parsing

2021-05-22 Thread Ted Dunning

I was able to test using 1.18 and find that the problem is gone. I was
unable to do a head to head test with 1.16, however, and couldn't figure
out how to run 1.18 on the same machines as the current 1.16 environment
without destablizing that 1.16 environment (collision on the plugins
directory). I didn't want to spend a lot of time so I will stick with the
judgment that the current behavior seems to be correct.

Notably, the nested quotes are handled correctly without any quoting.

Nice.

On Sat, May 22, 2021 at 6:45 PM luoc  wrote:

> Hi Ted,
>   You can use this reader without switching if you are using the latest
> version (1.19.0 for better). There are unit tests related to the compliant
> text reader (in `drill-java-exec` module, at the
> `org.apache.drill.exec.store.easy.text.compliant` package).
>
> > 2021年5月23日 上午5:19，Ted Dunning  写道：
> >
> > Also, where would I find the unit tests for the compliant text reader?
> >
> > I have a simple enough case to write a unit test, but I can't see any
> > reference to the class in question outside of working code.
> >
> >
> > On Thu, May 20, 2021 at 7:40 AM Ted Dunning 
> wrote:
> >
> >>
> >> Luoc,
> >>
> >> How do I use the CompliantTextBatchReader?
> >>
> >> How is the speed?
> >>
> >> Can you point me at the old CSV reader? I am not sure where it is.
> >>
> >>
> >>
> >> On Thu, May 20, 2021 at 1:09 AM luoc  wrote:
> >>
> >>> Hello Ted,
> >>> It's nice idea. I have done a quick review for the CSV reader, but not
> >>> found any settings to process the errors. And then, We have refactored
> the
> >>> CSV format using the EVF, please see the CompliantTextBatchReader.java
> >>> (Complies with the RFC 4180 standard for text/csv files).
> >>>
> >>>> 在 2021年5月20日，13:49，Ted Dunning  写道：
> >>>>
> >>>> I have a csv file that causes an exception when read by Drill. The
> >>> file is
> >>>> slightly mal-formed (but R can read it).
> >>>>
> >>>> Interestingly, if I don't parse the header line, I don't get the
> >>> exception
> >>>> and the problematic embedded quotes are handled well. Likewise,
> deleting
> >>>> the first data line (which is well-formed) causes the exception to go
> >>> away.
> >>>> Deleting the second data line also causes the exception to stop.
> Fixing
> >>> the
> >>>> quoting of the included quotes also fixes the problem. Swapping the
> >>> lines
> >>>> works like deleting the first line. Repeating the first line after the
> >>>> second line still gets the exception.
> >>>>
> >>>> The file is this:
> >>>> -
> >>>>
> >>>> desc,name
> >>>>
> >>>> "foo","x"
> >>>>
> >>>> "manure called "foo"","y"
> >>>>
> >>>> -
> >>>>
> >>>>
> >>>> The exception is shown below. My thought is that if the CSV file is
> >>>> considered mal-formed, we should get an error on the line that says
> >>>> something along the lines of "mal-formed input". Even better would be
> to
> >>>> allow such lines to be omitted (up to some sanity limit) or to parse
> it
> >>>> correctly (which happens without headers being parsed).
> >>>>
> >>>> Anybody have any thoughts?
> >>>>
> >>>> Here is the R behavior (it omits the embedded quotes):
> >>>>
> >>>>> f = read.csv("v.csv")
> >>>>
> >>>>> f
> >>>>
> >>>>  desc name
> >>>>
> >>>> 1   foox
> >>>>
> >>>> 2 manure called fooy
> >>>>
> >>>>
> >>>> And here is the exception:
> >>>>
> >>>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> >>>> NegativeArraySizeException Please, refer to logs for more information.
> >>>> [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ]
> >>>> (java.lang.NegativeArraySizeException) null
> >>>> org.apache.drill.exec.vector.VarCharVector$Accessor.get():487
> >>>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514
> >>>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475
> >>>> org.apache.drill.exec.server.rest.WebUserConnection.sendData():147
> >>>> org.apache.drill.exec.ops.AccountingUserConnection.sendData():42
> >>>>
> >>>
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120
> >>>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> >>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> >>>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> >>>> java.security.AccessController.doPrivileged():-2
> >>>> javax.security.auth.Subject.doAs():422
> >>>> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> >>>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> >>>> org.apache.drill.common.SelfCleaningRunnable.run():38
> >>>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> >>>> java.lang.Thread.run():748
> >>>
> >>
>
>

Re: known bug in csv header parsing

2021-05-22 Thread Ted Dunning

Also, where would I find the unit tests for the compliant text reader?

I have a simple enough case to write a unit test, but I can't see any
reference to the class in question outside of working code.


On Thu, May 20, 2021 at 7:40 AM Ted Dunning  wrote:

>
> Luoc,
>
> How do I use the CompliantTextBatchReader?
>
> How is the speed?
>
> Can you point me at the old CSV reader? I am not sure where it is.
>
>
>
> On Thu, May 20, 2021 at 1:09 AM luoc  wrote:
>
>> Hello Ted,
>> It's nice idea. I have done a quick review for the CSV reader, but not
>> found any settings to process the errors. And then, We have refactored the
>> CSV format using the EVF, please see the CompliantTextBatchReader.java
>> (Complies with the RFC 4180 standard for text/csv files).
>>
>> > 在 2021年5月20日，13:49，Ted Dunning  写道：
>> >
>> > I have a csv file that causes an exception when read by Drill. The
>> file is
>> > slightly mal-formed (but R can read it).
>> >
>> > Interestingly, if I don't parse the header line, I don't get the
>> exception
>> > and the problematic embedded quotes are handled well. Likewise, deleting
>> > the first data line (which is well-formed) causes the exception to go
>> away.
>> > Deleting the second data line also causes the exception to stop. Fixing
>> the
>> > quoting of the included quotes also fixes the problem. Swapping the
>> lines
>> > works like deleting the first line. Repeating the first line after the
>> > second line still gets the exception.
>> >
>> > The file is this:
>> > -
>> >
>> > desc,name
>> >
>> > "foo","x"
>> >
>> > "manure called "foo"","y"
>> >
>> > -
>> >
>> >
>> > The exception is shown below. My thought is that if the CSV file is
>> > considered mal-formed, we should get an error on the line that says
>> > something along the lines of "mal-formed input". Even better would be to
>> > allow such lines to be omitted (up to some sanity limit) or to parse it
>> > correctly (which happens without headers being parsed).
>> >
>> > Anybody have any thoughts?
>> >
>> > Here is the R behavior (it omits the embedded quotes):
>> >
>> >> f = read.csv("v.csv")
>> >
>> >> f
>> >
>> >   desc name
>> >
>> > 1   foox
>> >
>> > 2 manure called fooy
>> >
>> >
>> > And here is the exception:
>> >
>> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> > NegativeArraySizeException Please, refer to logs for more information.
>> > [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ]
>> > (java.lang.NegativeArraySizeException) null
>> > org.apache.drill.exec.vector.VarCharVector$Accessor.get():487
>> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514
>> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475
>> > org.apache.drill.exec.server.rest.WebUserConnection.sendData():147
>> > org.apache.drill.exec.ops.AccountingUserConnection.sendData():42
>> >
>> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120
>> > org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
>> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
>> > java.security.AccessController.doPrivileged():-2
>> > javax.security.auth.Subject.doAs():422
>> > org.apache.hadoop.security.UserGroupInformation.doAs():1669
>> > org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
>> > org.apache.drill.common.SelfCleaningRunnable.run():38
>> > java.util.concurrent.ThreadPoolExecutor.runWorker():1149
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run():624
>> > java.lang.Thread.run():748
>>
>

Re: known bug in csv header parsing

2021-05-22 Thread Ted Dunning

Luoc,

Thanks for your reply. Can you point me to documentation about how to
switch readers?



On Fri, May 21, 2021 at 7:08 AM luoc  wrote:

> Hi Ted,
>   You can use the new version of CSV reader (binding the
> CompliantTextBatchReader) to query the CSV since 1.16 (no changes in the
> usage). But this reader does not support your idea. I think we can provide
> a few codes to enhance the reader. All the new storage and format plugin
> base the EVF, more powerful and stable.
>
> > 2021年5月20日 下午10:40，Ted Dunning  写道：
> >
> > Luoc,
> >
> > How do I use the CompliantTextBatchReader?
> >
> > How is the speed?
> >
> > Can you point me at the old CSV reader? I am not sure where it is.
> >
> >
> >
> > On Thu, May 20, 2021 at 1:09 AM luoc  wrote:
> >
> >> Hello Ted,
> >> It's nice idea. I have done a quick review for the CSV reader, but not
> >> found any settings to process the errors. And then, We have refactored
> the
> >> CSV format using the EVF, please see the CompliantTextBatchReader.java
> >> (Complies with the RFC 4180 standard for text/csv files).
> >>
> >>> 在 2021年5月20日，13:49，Ted Dunning  写道：
> >>>
> >>> I have a csv file that causes an exception when read by Drill. The
> file
> >> is
> >>> slightly mal-formed (but R can read it).
> >>>
> >>> Interestingly, if I don't parse the header line, I don't get the
> >> exception
> >>> and the problematic embedded quotes are handled well. Likewise,
> deleting
> >>> the first data line (which is well-formed) causes the exception to go
> >> away.
> >>> Deleting the second data line also causes the exception to stop. Fixing
> >> the
> >>> quoting of the included quotes also fixes the problem. Swapping the
> lines
> >>> works like deleting the first line. Repeating the first line after the
> >>> second line still gets the exception.
> >>>
> >>> The file is this:
> >>> -
> >>>
> >>> desc,name
> >>>
> >>> "foo","x"
> >>>
> >>> "manure called "foo"","y"
> >>>
> >>> -
> >>>
> >>>
> >>> The exception is shown below. My thought is that if the CSV file is
> >>> considered mal-formed, we should get an error on the line that says
> >>> something along the lines of "mal-formed input". Even better would be
> to
> >>> allow such lines to be omitted (up to some sanity limit) or to parse it
> >>> correctly (which happens without headers being parsed).
> >>>
> >>> Anybody have any thoughts?
> >>>
> >>> Here is the R behavior (it omits the embedded quotes):
> >>>
> >>>> f = read.csv("v.csv")
> >>>
> >>>> f
> >>>
> >>>  desc name
> >>>
> >>> 1   foox
> >>>
> >>> 2 manure called fooy
> >>>
> >>>
> >>> And here is the exception:
> >>>
> >>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> >>> NegativeArraySizeException Please, refer to logs for more information.
> >>> [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ]
> >>> (java.lang.NegativeArraySizeException) null
> >>> org.apache.drill.exec.vector.VarCharVector$Accessor.get():487
> >>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514
> >>> org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475
> >>> org.apache.drill.exec.server.rest.WebUserConnection.sendData():147
> >>> org.apache.drill.exec.ops.AccountingUserConnection.sendData():42
> >>>
> >>
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120
> >>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> >>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> >>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> >>> java.security.AccessController.doPrivileged():-2
> >>> javax.security.auth.Subject.doAs():422
> >>> org.apache.hadoop.security.UserGroupInformation.doAs():1669
> >>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> >>> org.apache.drill.common.SelfCleaningRunnable.run():38
> >>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> >>> java.lang.Thread.run():748
> >>
>
>

Re: known bug in csv header parsing

2021-05-20 Thread Ted Dunning

Luoc,

How do I use the CompliantTextBatchReader?

How is the speed?

Can you point me at the old CSV reader? I am not sure where it is.



On Thu, May 20, 2021 at 1:09 AM luoc  wrote:

> Hello Ted,
> It's nice idea. I have done a quick review for the CSV reader, but not
> found any settings to process the errors. And then, We have refactored the
> CSV format using the EVF, please see the CompliantTextBatchReader.java
> (Complies with the RFC 4180 standard for text/csv files).
>
> > 在 2021年5月20日，13:49，Ted Dunning  写道：
> >
> > I have a csv file that causes an exception when read by Drill. The file
> is
> > slightly mal-formed (but R can read it).
> >
> > Interestingly, if I don't parse the header line, I don't get the
> exception
> > and the problematic embedded quotes are handled well. Likewise, deleting
> > the first data line (which is well-formed) causes the exception to go
> away.
> > Deleting the second data line also causes the exception to stop. Fixing
> the
> > quoting of the included quotes also fixes the problem. Swapping the lines
> > works like deleting the first line. Repeating the first line after the
> > second line still gets the exception.
> >
> > The file is this:
> > -
> >
> > desc,name
> >
> > "foo","x"
> >
> > "manure called "foo"","y"
> >
> > -
> >
> >
> > The exception is shown below. My thought is that if the CSV file is
> > considered mal-formed, we should get an error on the line that says
> > something along the lines of "mal-formed input". Even better would be to
> > allow such lines to be omitted (up to some sanity limit) or to parse it
> > correctly (which happens without headers being parsed).
> >
> > Anybody have any thoughts?
> >
> > Here is the R behavior (it omits the embedded quotes):
> >
> >> f = read.csv("v.csv")
> >
> >> f
> >
> >   desc name
> >
> > 1   foox
> >
> > 2 manure called fooy
> >
> >
> > And here is the exception:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> > NegativeArraySizeException Please, refer to logs for more information.
> > [Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ]
> > (java.lang.NegativeArraySizeException) null
> > org.apache.drill.exec.vector.VarCharVector$Accessor.get():487
> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514
> > org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475
> > org.apache.drill.exec.server.rest.WebUserConnection.sendData():147
> > org.apache.drill.exec.ops.AccountingUserConnection.sendData():42
> >
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120
> > org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
> > java.security.AccessController.doPrivileged():-2
> > javax.security.auth.Subject.doAs():422
> > org.apache.hadoop.security.UserGroupInformation.doAs():1669
> > org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
> > org.apache.drill.common.SelfCleaningRunnable.run():38
> > java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> > java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> > java.lang.Thread.run():748
>

known bug in csv header parsing

2021-05-19 Thread Ted Dunning

I have a csv file that causes an exception when read by Drill. The file is
slightly mal-formed (but R can read it).

Interestingly, if I don't parse the header line, I don't get the exception
and the problematic embedded quotes are handled well. Likewise, deleting
the first data line (which is well-formed) causes the exception to go away.
Deleting the second data line also causes the exception to stop. Fixing the
quoting of the included quotes also fixes the problem. Swapping the lines
works like deleting the first line. Repeating the first line after the
second line still gets the exception.

The file is this:
-

desc,name

"foo","x"

"manure called "foo"","y"

-


The exception is shown below. My thought is that if the CSV file is
considered mal-formed, we should get an error on the line that says
something along the lines of "mal-formed input". Even better would be to
allow such lines to be omitted (up to some sanity limit) or to parse it
correctly (which happens without headers being parsed).

Anybody have any thoughts?

Here is the R behavior (it omits the embedded quotes):

> f = read.csv("v.csv")

> f

   desc name

1   foox

2 manure called fooy


And here is the exception:

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
NegativeArraySizeException Please, refer to logs for more information.
[Error Id: 7153f837-45eb-43d1-8e19-e3ca0197c61b ]
(java.lang.NegativeArraySizeException) null
org.apache.drill.exec.vector.VarCharVector$Accessor.get():487
org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():514
org.apache.drill.exec.vector.VarCharVector$Accessor.getObject():475
org.apache.drill.exec.server.rest.WebUserConnection.sendData():147
org.apache.drill.exec.ops.AccountingUserConnection.sendData():42
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():120
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():296
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():283
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1669
org.apache.drill.exec.work.fragment.FragmentExecutor.run():283
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1149
java.util.concurrent.ThreadPoolExecutor$Worker.run():624
java.lang.Thread.run():748

Re: [VOTE] Add Dependabot to Drill

2021-05-16 Thread Ted Dunning

I love dependabot.

I do minimal maintenance on several dozen demo projects and having a bot
check the dependencies for vulnerabilities is a god-send.

There is no downside. Yes, I get a bunch of pull requests when somebody
digs up another obscure problem with Jackson, but that isn't a problem.  I
have to worry about dependencies anyway, so why not make it relatively easy
to do?

On Sun, May 16, 2021, 7:40 AM Charles Givre  wrote:

> Hello all,
> I'd like to propose adding Dependabot to our commit process.  If you
> aren't familiar with Dependabot, it scans dependencies and alerts you to
> dependencies that have vulnerabilities.  I ran dependabot on Drill's
> source, and found several rather serious CVEs associated with dependencies,
> hence the PRs to update Guava, JUnit, and a few others.
>
> I know that these automated code quality tests aren't always the best in
> terms of producing false positives, but I do think it is in general a good
> thing to at least be aware of these kinds of issues so that we can resolve
> them if they are deemed worthy.
>
> So... I'd like to call a vote.  Would you like to add dependabot to
> Drill's github repo?  Please vote yes or no by Thursday.
>
> Thanks and Keep on Drilling!
> -- C
>
>

Re: Test Apache Drill on Linux ARM64

2021-05-07 Thread Ted Dunning

Martin,

This is exciting stuff that you are doing and very useful.

My thought is that of the options you describe, it seems like the travis
option is a good first step because it is nearly trivial (just add the ci
config file with a trivial build and test)

Running builds on a remote builder nodes seems to me to increase
dependencies that could cause debug actions at a later stage. I don't
understand the level of stability that should be expected and I don't
understand how certain that expectation should be.

I am not clear on CircleCI versus Github Actions versus travis. The
timeouts sound better and you mention arm support, but I have no experience
to guide.

Others probably have better and more complete thoughts than these.



On Fri, May 7, 2021 at 5:02 AM Martin Tzvetanov Grigorov <
mgrigo...@apache.org> wrote:

> Hello Drill developers,
>
> Recently I've tried to build Apache Drill on ARM64 hardware running on
> Linux.
> I have found few issues which are described in issue
> https://issues.apache.org/jira/browse/DRILL-7911
>
> I've created few Pull Requests with fixes for each issue:
> - https://github.com/apache/drill/pull/2217 - use TestContainers-MySQL
> instead of Wix-Embedded-MySQL
> - https://github.com/apache/drill/pull/2218 - Disable Storage-Splunk unit
> tests on Linux ARM64 because there is no Docker image of Splunk for
> Linux/arm64
> - https://github.com/apache/drill/pull/2219 - Increase Max Direct Memory
>
> Now I would like to suggest adding CI testing on ARM64 to prevent
> regressions in the future.
> The problem is that GitHub Actions (the CI system used by Apache Drill)
> does not yet support ARM64 architecture.
>
> Here are the possible solutions I am aware of:
>
> * use TravisCI only for running `mvn install` on Linux ARM64
> Pros:
> - TravisCI supports Linux ARM64 out of the box and the config is quite
> simple
> - Might be useful later if someone wants to add testing on Linux s390x
> Cons:
> - Use a second CI for such specific purpose
>
> * Use GitHub Actions to run the build at a remote Kubernetes cluster with
> ARM64 nodes
> More details about this approach could be read at
> https://martin-grigorov.medium.com/githubactions-build-and-test-on-huaweicloud-arm64-af9d5c97b766
> Disclaimer: I work for OpenLab Testing and Huawei sponsor us, so I can get
> you a free account at HuaweiCloud for such setup.
> The same setup could be used with any other Kubernetes provider!
>
> * Use CircleCI instead of GitHub Actions
> Pros:
> - native support for both x86_64 and aarch64  (
> https://github.com/CircleCI-Public/arm-preview-docs)
> - CircleCI allows connecting via SSH to a builder node. This way one can
> debug issues
> - higher job timeout (5h) -
> https://circleci.com/docs/2.0/runner-installation/#runner-max_run_time.
> Currently Github Actions often fail due to build timeouts of 90mins
> - it is less crowded than the Apache organization at GitHub Actions (
> https://ibb.co/RpFyQQy), so there is less wait time for the build
> Cons:
> - work is required to migrate from GitHub Actions to CircleCI
>
> I volunteer to do the work for any of these options. Just please let me
> know which one is your preferred one!
>
> Regards,
> Martin
>

Re: [DISCUSS] Drill 1.19.0 release

2021-05-03 Thread Ted Dunning

Laurent,

I don't have a stake here, so can't really comment about specifics, but the
process is looking good.



On Mon, May 3, 2021 at 9:23 PM Laurent Goujon  wrote:

> Thanks for all the answers
>
> So the issues I found based on the feedback are:
>
>- DRILL-7878: Fix LGTM Alerts
>
>- DRILL-7871: StoragePluginStore instances for different users
>
>- DRILL-7908: Fix GitHub Actions CI
>
>- DRILL-7904: Update to 30-jre Guava version
>
>- DRILL-7826: Merge Pcap and Pcapng format plugin based on EVF
>
>   - DRILL-7828: Refactor Pcap and Pcapng format plugin
>   
>- DRILL-7910: Bumps commons-io from 2.4 to 2.7
>
>- DRILL-7901: Bump junit from 4.12 to 4.13.1
>
>
> I wanted to propose Monday May 10th to do the first release candidate, but
> I have some concerns about some of the changes which may not be ready by
> then considering they seem to involve some level of effort and are in very
> early stage: The LGTM alert changes and the StoragePluginStore model
> change. JUnit version update might also become quite a large change if
> instead of moving to 4.13.1, Drill is switching to JUnit5.
>
> What do people think?
>
> On Sat, Apr 24, 2021 at 1:00 PM Vitalii Diravka 
> wrote:
>
> > Hi Laurent,
> >
> > I want to include:
> > DRILL-7871  (preparing
> > PR)
> > DRILL-7908  (preparing
> > PR)
> > DRILL-7904  (PR is
> > opened, in review)
> > DRILL-7828  (PR is
> > opened, review is almost completed)
> >
> > All these tasks are expected to be completed in a week
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Fri, Apr 23, 2021 at 9:25 PM Charles Givre  wrote:
> >
> > > Hi Laurent,
> > > We have a few PRs pending which I'd like to see in the next version
> which
> > > are:
> > > 1.  The update(s) and bug fixes to the Mongo plugin.
> > > 2.  There is an extended PR for bug fixes which clean up a lot of
> alerts
> > > generated by LGTM
> > > 3.  There are a few other library updates which are pending.
> > > 4.  We have some work which changes the access model around storage
> > > plugins which would be good for this release
> > > 5.  The PCAP/PCAP-NG consolidation is awaiting review.
> > >
> > > I think that's it.
> > > -- C
> > >
> > > > On Apr 22, 2021, at 12:33 PM, Laurent Goujon 
> > wrote:
> > > >
> > > > Hello everyone,
> > > >
> > > > It has been more than 6 months since the last release, and I believe
> > this
> > > > would be a good time to discuss the next one.
> > > >
> > > > As mentioned in a previous email thread, I am volunteering to be the
> > > > release manager, and I'm looking forward  working with the whole
> > > community
> > > > to make another great release.
> > > >
> > > > We have around 80 changes in master since the last release, and there
> > are
> > > > several changes open for review too. It would be nice if people could
> > > reply
> > > > to this email and share issues which should be part of that release,
> so
> > > we
> > > > can decide on an initial cut-off date.
> > > >
> > > > Thanks in advance,
> > > >
> > > > Laurent
> > >
> > >
> >
>

Re: Feature/Question

2021-04-27 Thread Ted Dunning

Akshay,

That's great news!

On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasi...@bloomberg.net> wrote:

> Hi Ted,
>
> Thanks for reaching out. Yes - the below worked successfully.
>
> I was able to create different objects in s3 like 'XXX/YYY/filename',
> 'XXX/ZZZ/filename' and able to query like
> SELECT * FROM XXX.
>
> Thanks !
>
> Best,
> Akshay
>
> From: ted.dunn...@gmail.com At: 04/21/21 17:21:42
> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) ,
> dev@drill.apache.org
> Subject: Re: Feature/Question
>
>
> Akshay,
>
> Yes. It is possible to do what you want from a few different angles.
>
> As you have noted, S3 doesn't have directories. Not really. On the other
> hand, people simulate this using naming schemes and S3 has some support for
> this.
>
> One of the simplest ways to deal with this is to create a view that
> explicitly mentions every S3 object that you have in your table. The
> contents of this view can get a bit cumbersome, but that shouldn't be a
> problem since users never need to know. You will need to set up a scheduled
> action to update this view occasionally, but that is pretty simple.
>
> The other way is to use a naming scheme with a delimiter such as /. This
> is described at
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
> If you do that and have files named (for instance) foo/a.json, foo/b.json,
> foo/c.json and you query
>
> select * from s3.`foo`
>
> you should see the contents of a.json, b.json and c.json. See here for
> commentary
> 
>
> I haven't tried this, however, so I am simply going on the reports of
> others. If this works for you, please report your success back here.
>
>
>
>
>
> On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
> abhasi...@bloomberg.net> wrote:
>
>> Hi Drill Community,
>>
>> I'm Akshay and I'm using Drill for a project I'm working on.
>>
>> There is this particular use case I want to implement - I want to know if
>> its possible.
>>
>> 1) Currently, we have a partition of file system and we create a view on
>> top of it. For example, we have below directory structure -
>>
>> /home/product/product_name/year/month/day/*parquet
>> /home/product/product_name_2/year/month/day/*parquet
>> /home/product/product_name_3/year/month/day/*parquetdev
>>
>> Now, we create a view over it -
>> Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as
>> month, `dir3` as day, * from dfs.`/home/product`;
>>
>> Then, we can query all the data dynamically -
>> SELECT * from temp LIMIT 5;
>>
>> 2) Now I want to replicate this behavior via s3. I want to ask if its
>> possible - I was able to create a logical directory. But s3 inherently does
>> not support directories only objects.
>>
>> Therefore, I was curious to know if it is supported/way to do this. I was
>> unable to find any documentation on your website related to partitioning
>> data on s3.
>>
>> Thanks for your help.
>> Best,
>> Akshay
>
>
>

Re: [DISCUSS] Drill 1.19.0 release

2021-04-22 Thread Ted Dunning

Laurent,

I don't have any issues that I feel I need in the next release, but I would
like to add my encouragement to what you are doing!!



On Thu, Apr 22, 2021 at 9:33 AM Laurent Goujon  wrote:

> Hello everyone,
>
> It has been more than 6 months since the last release, and I believe this
> would be a good time to discuss the next one.
>
> As mentioned in a previous email thread, I am volunteering to be the
> release manager, and I'm looking forward  working with the whole community
> to make another great release.
>
> We have around 80 changes in master since the last release, and there are
> several changes open for review too. It would be nice if people could reply
> to this email and share issues which should be part of that release, so we
> can decide on an initial cut-off date.
>
> Thanks in advance,
>
> Laurent
>

Re: Feature/Question

2021-04-21 Thread Ted Dunning

Akshay,

Yes. It is possible to do what you want from a few different angles.

As you have noted, S3 doesn't have directories. Not really. On the other
hand, people simulate this using naming schemes and S3 has some support for
this.

One of the simplest ways to deal with this is to create a view that
explicitly mentions every S3 object that you have in your table. The
contents of this view can get a bit cumbersome, but that shouldn't be a
problem since users never need to know. You will need to set up a scheduled
action to update this view occasionally, but that is pretty simple.

The other way is to use a naming scheme with a delimiter such as /. This is
described at
https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
If you do that and have files named (for instance) foo/a.json, foo/b.json,
foo/c.json and you query

select * from s3.`foo`

you should see the contents of a.json, b.json and c.json. See here for
commentary

I haven't tried this, however, so I am simply going on the reports of
others. If this works for you, please report your success back here.

On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasi...@bloomberg.net> wrote:

> Hi Drill Community,
>
> I'm Akshay and I'm using Drill for a project I'm working on.
>
> There is this particular use case I want to implement - I want to know if
> its possible.
>
> 1) Currently, we have a partition of file system and we create a view on
> top of it. For example, we have below directory structure -
>
> /home/product/product_name/year/month/day/*parquet
> /home/product/product_name_2/year/month/day/*parquet
> /home/product/product_name_3/year/month/day/*parquetdev
>
> Now, we create a view over it -
> Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as
> month, `dir3` as day, * from dfs.`/home/product`;
>
> Then, we can query all the data dynamically -
> SELECT * from temp LIMIT 5;
>
> 2) Now I want to replicate this behavior via s3. I want to ask if its
> possible - I was able to create a logical directory. But s3 inherently does
> not support directories only objects.
>
> Therefore, I was curious to know if it is supported/way to do this. I was
> unable to find any documentation on your website related to partitioning
> data on s3.
>
> Thanks for your help.
> Best,
> Akshay

Re: Drill Slowness

2021-04-19 Thread Ted Dunning

(adding more)

The issue described on stack overflow is also very likely if you have a
complex schema.



On Mon, Apr 19, 2021 at 8:40 AM Ted Dunning  wrote:

>
>
> Dileep,
>
> As Charles suggested, the problem is probably to do with limitations on
> what parts of the query is being pushed down into Oracle. That results in
> lots of data motion that isn't necessary when the query is executed
> entirely within Oracle because the Oracle database can use indexes and such
> to avoid even looking at much of the data (and even if it does look, it has
> much faster channels than JDBC available).
>
>
>
>
> On Mon, Apr 19, 2021 at 5:12 AM Charles Givre  wrote:
>
>> Dileep,
>> You also may want to take a look at this article from SO.
>>
>>
>> https://stackoverflow.com/questions/41814970/extremely-slow-apache-drill-query-using-oracle-jdbc
>>
>> -- C
>>
>>
>>
>> > On Apr 19, 2021, at 5:45 AM, dileep kumar  wrote:
>> >
>> > Hi Team,
>> >
>> >
>> >
>> > I am a novice in Drill and getting my hands dirty on Apache Drill.
>> >
>> > I have installed Drill and am trying to execute a query(joins multiple
>> > oracle tables) against Oracle database.
>> >
>> > Same query is executed in 0.34 seconds on the oracle server , but in
>> Drill
>> > it took 31 mins which is weird.
>> >
>> > I tried increasing Heap memory but still no change in performance.
>> >
>> > Can someone there help me out ?
>> >
>> >
>> >
>> > Regards
>> >
>> > Dileep Kumar
>>
>>

Re: Drill Slowness

2021-04-19 Thread Ted Dunning

Dileep,

As Charles suggested, the problem is probably to do with limitations on
what parts of the query is being pushed down into Oracle. That results in
lots of data motion that isn't necessary when the query is executed
entirely within Oracle because the Oracle database can use indexes and such
to avoid even looking at much of the data (and even if it does look, it has
much faster channels than JDBC available).

On Mon, Apr 19, 2021 at 5:12 AM Charles Givre  wrote:

> Dileep,
> You also may want to take a look at this article from SO.
>
>
> https://stackoverflow.com/questions/41814970/extremely-slow-apache-drill-query-using-oracle-jdbc
>
> -- C
>
>
>
> > On Apr 19, 2021, at 5:45 AM, dileep kumar  wrote:
> >
> > Hi Team,
> >
> >
> >
> > I am a novice in Drill and getting my hands dirty on Apache Drill.
> >
> > I have installed Drill and am trying to execute a query(joins multiple
> > oracle tables) against Oracle database.
> >
> > Same query is executed in 0.34 seconds on the oracle server , but in
> Drill
> > it took 31 mins which is weird.
> >
> > I tried increasing Heap memory but still no change in performance.
> >
> > Can someone there help me out ?
> >
> >
> >
> > Regards
> >
> > Dileep Kumar
>
>

Re: Issue with connecting Apache Drill to S3

2021-04-13 Thread Ted Dunning

WHat happens when you just run a query that reference s3 data instead of
trying to do the "use s3" command?



On Tue, Apr 13, 2021 at 6:15 AM khushalkj jain 
wrote:

> Hi All,
> Need your help in fixing the below Issue.
> I am running drill locally on my MAC in embedded mode.
>
> *Query:*
>
> > use s3;
>
> *Log with Error info :*
>
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> > AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400;
> Error
> > Code: 400 Bad Request; Request ID: G9TTDZNV531H5RS9; S3 Extended Request
> > ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=)
> > Please, refer to logs for more information.
> > [Error Id: ab06e603-feea-40cc-933f-85904f731ed8 on 192.168.1.7:31010]
> >   (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
> > exception during fragment initialization: Failed to create
> DrillFileSystem
> > for proxy user: doesBucketExist on c360-archival:
> > com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service:
> > Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:
> > G9TTDZNV531H5RS9; S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=),
> > S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=:400
> > Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error
> Code:
> > 400 Bad Request; Request ID: G9TTDZNV531H5RS9; S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=)
> > org.apache.drill.exec.work.foreman.Foreman.run():301
> > java.util.concurrent.ThreadPoolExecutor.runWorker():1130
> > java.util.concurrent.ThreadPoolExecutor$Worker.run():630
> > java.lang.Thread.run():832
> >   Caused By (org.apache.drill.common.exceptions.DrillRuntimeException)
> > Failed to create DrillFileSystem for proxy user: doesBucketExist on
> > c360-archival: com.amazonaws.services.s3.model.AmazonS3Exception: Bad
> > Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad
> Request;
> > Request ID: G9TTDZNV531H5RS9; S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=),
> > S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=:400
> > Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error
> Code:
> > 400 Bad Request; Request ID: G9TTDZNV531H5RS9; S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=)
> > org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():220
> > org.apache.drill.exec.util.ImpersonationUtil.createFileSystem():205
> >
> >
> org.apache.drill.exec.store.dfs.FileSystemSchemaFactory$FileSystemSchema.():84
> >
> >
> org.apache.drill.exec.store.dfs.FileSystemSchemaFactory.registerSchemas():72
> >
>  org.apache.drill.exec.store.dfs.FileSystemPlugin.registerSchemas():232
> > org.apache.calcite.jdbc.DynamicRootSchema.loadSchemaFactory():87
> > org.apache.calcite.jdbc.DynamicRootSchema.getImplicitSubSchema():72
> > org.apache.calcite.jdbc.CalciteSchema.getSubSchema():265
> >
>  org.apache.calcite.jdbc.CalciteSchema$SchemaPlusImpl.getSubSchema():684
> >
>  org.apache.drill.exec.planner.sql.SchemaUtilites.searchSchemaTree():98
> > org.apache.drill.exec.planner.sql.SchemaUtilites.findSchema():51
> > org.apache.drill.exec.rpc.user.UserSession.setDefaultSchemaPath():225
> >
> > org.apache.drill.exec.planner.sql.handlers.UseSchemaHandler.getPlan():43
> > org.apache.drill.exec.planner.sql.DrillSqlWorker.getQueryPlan():283
> >
>  org.apache.drill.exec.planner.sql.DrillSqlWorker.getPhysicalPlan():163
> > org.apache.drill.exec.planner.sql.DrillSqlWorker.convertPlan():128
> > org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan():93
> > org.apache.drill.exec.work.foreman.Foreman.runSQL():593
> > org.apache.drill.exec.work.foreman.Foreman.run():274
> > java.util.concurrent.ThreadPoolExecutor.runWorker():1130
> > java.util.concurrent.ThreadPoolExecutor$Worker.run():630
> > java.lang.Thread.run():832
> >   Caused By (org.apache.hadoop.fs.s3a.AWSBadRequestException)
> > doesBucketExist on c360-archival:
> > com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service:
> > Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID:
> > G9TTDZNV531H5RS9; S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=),
> > S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=:400
> > Bad Request: Bad Request (Service: Amazon S3; Status Code: 400; Error
> Code:
> > 400 Bad Request; Request ID: G9TTDZNV531H5RS9; S3 Extended Request ID:
> >
> ihj+EsqMcF3qlP2EYHBwuarC5mOqiQ/PvVfgmu722WY8pL5VgRU69gbl4U1B3vpNqYYjcbiejGs=)
> >

Re: Requesting a release

2021-04-13 Thread Ted Dunning

Laurent,

It is great to hear this!



On Tue, Apr 13, 2021 at 8:29 AM Laurent Goujon  wrote:

> It's true what our contribution to Drill's dwelled a bit unfortunately as
> we decided to take a different approach on some core aspects, but we are
> still using the same User protocol for JDBC/ODBC connectivity, and we are
> still contributing those changes to the Apache Drill project hopefully to
> the benefit of both projects.
>
> Laurent
>
> On Mon, Apr 12, 2021 at 3:26 PM Charles Givre  wrote:
>
> > Laurent,
> > Thanks for volunteering.  There are a few PRs and work planned in the
> next
> > month or so for the next release.  Additionally, there are some tests
> which
> > must be run outside of the regular unit test framework that require a
> > special setup.
> >
> > I have to ask as I am curious but what/why is Dremio still interested in
> > Drill?  All the code is OSS, and you could just use what's already in
> > github without it being "released".
> > Best,
> > - C
> >
> >
> >
> > > On Apr 12, 2021, at 5:45 PM, Laurent Goujon 
> wrote:
> > >
> > > Hi Ted,
> > >
> > > I was led to believe that only a PMC member could perform some of the
> > > release tasks, but if not the case, I'm happy to volunteer for the next
> > > one. Since it would be my first release, is there any document
> detailing
> > > the list of tasks to be completed?
> > >
> > > On Mon, Apr 12, 2021 at 1:55 PM Ted Dunning 
> > wrote:
> > >
> > >> Hey Ray,
> > >>
> > >> Any Drill committer should be able to act as a release manager.
> > >>
> > >> My guess is that you know several Drill committers at Dremio who might
> > be
> > >> able to help with this.
> > >>
> > >>
> > >>
> > >> On Mon, Apr 12, 2021 at 12:00 PM Ray Lum  wrote:
> > >>
> > >>> Hi Drill community,
> > >>>
> > >>> Is there a process for requesting a release of the latest code
> > currently
> > >> in
> > >>> master? I am keen on adopting some of the changes if they were in an
> > >>> official release.
> > >>>
> > >>> Thanks kindly,
> > >>> Ray
> > >>>
> > >>
> >
> >
>

Re: Requesting a release

2021-04-12 Thread Ted Dunning

Laurent, 

There are definitely some steps that require privileges, but the management of 
the process doesn't require privileges.

The key steps are a) developing consensus on the content of the release, b) 
building a release candidate and c) conducting the vote. After these, pushing 
the artifacts may require some mojo, but it is easy to get somebody to help. 
None of these other steps require more than a commit bit.

Remember that the core point of a release is the community involvement, not the 
technical aspects of packaging.

On 2021/04/12 21:45:22, Laurent Goujon  wrote: 
> Hi Ted,
> 
> I was led to believe that only a PMC member could perform some of the
> release tasks, but if not the case, I'm happy to volunteer for the next
> one. Since it would be my first release, is there any document detailing
> the list of tasks to be completed?
> 
> On Mon, Apr 12, 2021 at 1:55 PM Ted Dunning  wrote:
> 
> > Hey Ray,
> >
> > Any Drill committer should be able to act as a release manager.
> >
> > My guess is that you know several Drill committers at Dremio who might be
> > able to help with this.
> >
> >
> >
> > On Mon, Apr 12, 2021 at 12:00 PM Ray Lum  wrote:
> >
> > > Hi Drill community,
> > >
> > > Is there a process for requesting a release of the latest code currently
> > in
> > > master? I am keen on adopting some of the changes if they were in an
> > > official release.
> > >
> > > Thanks kindly,
> > > Ray
> > >
> >
>

Re: Requesting a release

2021-04-12 Thread Ted Dunning

Hey Ray,

Any Drill committer should be able to act as a release manager.

My guess is that you know several Drill committers at Dremio who might be
able to help with this.

On Mon, Apr 12, 2021 at 12:00 PM Ray Lum  wrote:

> Hi Drill community,
>
> Is there a process for requesting a release of the latest code currently in
> master? I am keen on adopting some of the changes if they were in an
> official release.
>
> Thanks kindly,
> Ray
>

Re: Regular Video Calls?

2021-03-04 Thread Ted Dunning

Luoc,

Don't feel shy about not being able to speak well. We all have important
languages that we can't speak well. If you can mostly understand spoken
language, you can still easily participate. For one thing, there is the
chat as Curtis points out. Based on your email, it looks like. you are
pretty good with written English, in any case.

For another point, there may be somebody else on the meeting who can
understand a language you are more comfortable with.

But the most important thing to remember is that a video meeting is not a
replacement for the mailing list. Any important ideas from a live meeting
need to be brought back to the mailing list for discussion and decision
making. It isn't reasonable to expect all of our worldwide participants to
be in any kind of realtime meeting. I am speaking right now at 7AM because
I got up for an inconvenient meeting (after staying up late) so I really
feel that pain.

I love your enthusiasm for the project and it would be terrible to lose
that!



On Wed, Mar 3, 2021 at 10:20 PM luoc  wrote:

> Wow!
>   That sounds good. even though I know a little spoken english only.
> Because I don’t think you are going to blocked the people who only listen
> to the discussion. GMT+8
>
> > 2021年3月4日 上午3:13，Ted Dunning  写道：
> >
> > I am still around, but not super active lately. Real life has intruded a
> > lot over the last two years.
> >
> >
> >
> > On Wed, Mar 3, 2021 at 11:02 AM Curtis Lambert 
> > wrote:
> >
> >> Thanks Ted! I've been reading the archive history for the mailing list
> and
> >> see you on there a lot, right from the start. Glad to see you're still
> >> around and active on here!
> >>
> >>
> >>
> >> [image: avatar]
> >> Curtis Lambert
> >> CTO
> >> Email:
> >>
> >> cur...@datdistillr.com
> >> Phone:
> >>
> >> + 706-402-0249
> >> [image: LinkedIn]LinkedIn
> >> <https://www.linkedin.com/in/curtis-lambert-2009b2141/> [image:
> Calendly]
> >> Calendly <https://calendly.com/curtis283/30min>
> >> [image: Data Distillr logo] <https://www.datadistillr.com/>
> >>
> >>
> >> On Wed, Mar 3, 2021 at 12:26 PM Ted Dunning 
> wrote:
> >>
> >>> Curtis,
> >>>
> >>> I think that would be a great thing. The Drill community has changed
> over
> >>> the last few years and having periodic events could help people come
> >>> together in a new way.
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Mar 3, 2021 at 6:35 AM Curtis Lambert  >
> >>> wrote:
> >>>
> >>>> All,
> >>>>
> >>>> I'm still very new to Drill but as I'm getting spun up on things I
> >>> noticed
> >>>> there used to be google hangout meets every two weeks but they appear
> >> to
> >>>> have stopped in 2017. Looking to gather input on if they are worth
> >>> starting
> >>>> back up and what points they would cover (recognizing all decisions
> are
> >>>> made here not in the meetings). I'm willing to organize and host if
> the
> >>>> interest is there.
> >>>>
> >>>> Please weigh in on these points:
> >>>>
> >>>>   - If we had regular video calls would you attend?
> >>>>   - What is the group's preferred application for that? (Zoom/Google?)
> >>>>   - Periodicity of the calls (every two weeks, monthly, other?)
> >>>>   - Time of day? (lets use Zulu/GMT/UTC for this point to normalize)
> >>>>   - Potential topics/scope? (I think some combination of design
> >>>>   discussions and component walkthrough/learning/presenting would be
> >>> good
> >>>> for
> >>>>   expanding and invigorating the community)
> >>>>
> >>>>
> >>>> [image: avatar]
> >>>> Curtis Lambert
> >>>> CTO
> >>>> Email:
> >>>>
> >>>> cur...@datdistillr.com
> >>>> Phone:
> >>>>
> >>>> + 706-402-0249
> >>>> [image: LinkedIn]LinkedIn
> >>>> <https://www.linkedin.com/in/curtis-lambert-2009b2141/> [image:
> >>> Calendly]
> >>>> Calendly <https://calendly.com/curtis283/30min>
> >>>> [image: Data Distillr logo] <https://www.datadistillr.com/>
> >>>>
> >>>
> >>
>
>

Re: Regular Video Calls?

2021-03-03 Thread Ted Dunning

I am still around, but not super active lately. Real life has intruded a
lot over the last two years.



On Wed, Mar 3, 2021 at 11:02 AM Curtis Lambert 
wrote:

> Thanks Ted! I've been reading the archive history for the mailing list and
> see you on there a lot, right from the start. Glad to see you're still
> around and active on here!
>
>
>
> [image: avatar]
> Curtis Lambert
> CTO
> Email:
>
> cur...@datdistillr.com
> Phone:
>
> + 706-402-0249
> [image: LinkedIn]LinkedIn
> <https://www.linkedin.com/in/curtis-lambert-2009b2141/> [image: Calendly]
> Calendly <https://calendly.com/curtis283/30min>
> [image: Data Distillr logo] <https://www.datadistillr.com/>
>
>
> On Wed, Mar 3, 2021 at 12:26 PM Ted Dunning  wrote:
>
> > Curtis,
> >
> > I think that would be a great thing. The Drill community has changed over
> > the last few years and having periodic events could help people come
> > together in a new way.
> >
> >
> >
> >
> > On Wed, Mar 3, 2021 at 6:35 AM Curtis Lambert 
> > wrote:
> >
> > > All,
> > >
> > > I'm still very new to Drill but as I'm getting spun up on things I
> > noticed
> > > there used to be google hangout meets every two weeks but they appear
> to
> > > have stopped in 2017. Looking to gather input on if they are worth
> > starting
> > > back up and what points they would cover (recognizing all decisions are
> > > made here not in the meetings). I'm willing to organize and host if the
> > > interest is there.
> > >
> > > Please weigh in on these points:
> > >
> > >- If we had regular video calls would you attend?
> > >- What is the group's preferred application for that? (Zoom/Google?)
> > >- Periodicity of the calls (every two weeks, monthly, other?)
> > >- Time of day? (lets use Zulu/GMT/UTC for this point to normalize)
> > >- Potential topics/scope? (I think some combination of design
> > >discussions and component walkthrough/learning/presenting would be
> > good
> > > for
> > >expanding and invigorating the community)
> > >
> > >
> > > [image: avatar]
> > > Curtis Lambert
> > > CTO
> > > Email:
> > >
> > > cur...@datdistillr.com
> > > Phone:
> > >
> > > + 706-402-0249
> > > [image: LinkedIn]LinkedIn
> > > <https://www.linkedin.com/in/curtis-lambert-2009b2141/> [image:
> > Calendly]
> > > Calendly <https://calendly.com/curtis283/30min>
> > > [image: Data Distillr logo] <https://www.datadistillr.com/>
> > >
> >
>

Re: Regular Video Calls?

2021-03-03 Thread Ted Dunning

Curtis,

I think that would be a great thing. The Drill community has changed over
the last few years and having periodic events could help people come
together in a new way.




On Wed, Mar 3, 2021 at 6:35 AM Curtis Lambert 
wrote:

> All,
>
> I'm still very new to Drill but as I'm getting spun up on things I noticed
> there used to be google hangout meets every two weeks but they appear to
> have stopped in 2017. Looking to gather input on if they are worth starting
> back up and what points they would cover (recognizing all decisions are
> made here not in the meetings). I'm willing to organize and host if the
> interest is there.
>
> Please weigh in on these points:
>
>- If we had regular video calls would you attend?
>- What is the group's preferred application for that? (Zoom/Google?)
>- Periodicity of the calls (every two weeks, monthly, other?)
>- Time of day? (lets use Zulu/GMT/UTC for this point to normalize)
>- Potential topics/scope? (I think some combination of design
>discussions and component walkthrough/learning/presenting would be good
> for
>expanding and invigorating the community)
>
>
> [image: avatar]
> Curtis Lambert
> CTO
> Email:
>
> cur...@datdistillr.com
> Phone:
>
> + 706-402-0249
> [image: LinkedIn]LinkedIn
>  [image: Calendly]
> Calendly 
> [image: Data Distillr logo] 
>

Re: [DISCUSSION] ARM-based compatibility tests

2021-01-27 Thread Ted Dunning

Cool.



On Wed, Jan 27, 2021 at 12:40 PM Ganesh Raju  wrote:

> Ted,
> These hardware would be a proper ARM based datacenter server VM instance
>
> Ganesh
>
> On Wed, Jan 27, 2021 at 12:00 PM Ted Dunning 
> wrote:
>
> > Yes. The ARM-based macs sound pretty exciting.
> >
> > My own laptop is about 5 years old so it might be time to think about
> it. I
> > have two ARMs on my desk and 4 Intel machines. The odds could even up if
> > the wind blows right.
> >
> >
> >
> > On Wed, Jan 27, 2021 at 5:39 AM luoc  wrote:
> >
> > > Hi,
> > >
> > > @Ted Dunning, I saw that Apple has released an ARM-based Mac (the CPU
> > > called M1), It's maybe could drive the open source ecosystem.
> > >
> > > @Ganesh, Have you donated machines to other Apache TLP?
> > >
> > > > 2021年1月27日 下午8:21，Vitalii Diravka  写道：
> > > >
> > > > Hi Ganesh!
> > > >
> > > > Could you give more info how it can be used, for what period of time
> > and
> > > > under what terms of use?
> > > >
> > > > Thanks
> > > >
> > > > Kind regards
> > > > Vitalii
> > > >
> > > >
> > > > On Tue, Jan 26, 2021 at 7:42 PM Ganesh Raju 
> > > wrote:
> > > >
> > > >> We could also donate ARM machines to setup in CI, if it would make
> > > sense.
> > > >>
> > > >> Regards
> > > >> Ganesh
> > > >>
> > > >> On Tue, Jan 26, 2021 at 11:02 AM Ted Dunning  >
> > > >> wrote:
> > > >>
> > > >>> I did some minimal testing in embedded mode way back, but nothing
> > > >> serious.
> > > >>>
> > > >>> I saw no issues at all.
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, Jan 26, 2021 at 2:53 AM luoc  wrote:
> > > >>>
> > > >>>> Hi all,
> > > >>>>
> > > >>>> I have some ARM-based machines (Not X86 architecture), and then
> want
> > > to
> > > >>> do
> > > >>>> ARM-based compatibility tests. I know that Netty must bump to 4.1
> at
> > > >>> first
> > > >>>> (See also #9804 <https://github.com/netty/netty/pull/9804>). do
> we
> > > >> have
> > > >>>> anything else to upgrade? thanks for your time.
> > > >>>>
> > > >>>> Kind regards
> > > >>>> luoc
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> IRC: ganeshraju@#linaro on irc.freenode.ne <
> http://irc.freenode.net/
> > >t
> > > >>
> > >
> > >
> >
>
>
> --
> IRC: ganeshraju@#linaro on irc.freenode.ne <http://irc.freenode.net/>t
>

Re: [DISCUSSION] ARM-based compatibility tests

2021-01-27 Thread Ted Dunning

Yes. The ARM-based macs sound pretty exciting.

My own laptop is about 5 years old so it might be time to think about it. I
have two ARMs on my desk and 4 Intel machines. The odds could even up if
the wind blows right.



On Wed, Jan 27, 2021 at 5:39 AM luoc  wrote:

> Hi,
>
> @Ted Dunning, I saw that Apple has released an ARM-based Mac (the CPU
> called M1), It's maybe could drive the open source ecosystem.
>
> @Ganesh, Have you donated machines to other Apache TLP?
>
> > 2021年1月27日 下午8:21，Vitalii Diravka  写道：
> >
> > Hi Ganesh!
> >
> > Could you give more info how it can be used, for what period of time and
> > under what terms of use?
> >
> > Thanks
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Tue, Jan 26, 2021 at 7:42 PM Ganesh Raju 
> wrote:
> >
> >> We could also donate ARM machines to setup in CI, if it would make
> sense.
> >>
> >> Regards
> >> Ganesh
> >>
> >> On Tue, Jan 26, 2021 at 11:02 AM Ted Dunning 
> >> wrote:
> >>
> >>> I did some minimal testing in embedded mode way back, but nothing
> >> serious.
> >>>
> >>> I saw no issues at all.
> >>>
> >>>
> >>>
> >>> On Tue, Jan 26, 2021 at 2:53 AM luoc  wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I have some ARM-based machines (Not X86 architecture), and then want
> to
> >>> do
> >>>> ARM-based compatibility tests. I know that Netty must bump to 4.1 at
> >>> first
> >>>> (See also #9804 <https://github.com/netty/netty/pull/9804>). do we
> >> have
> >>>> anything else to upgrade? thanks for your time.
> >>>>
> >>>> Kind regards
> >>>> luoc
> >>>
> >>
> >>
> >> --
> >> IRC: ganeshraju@#linaro on irc.freenode.ne <http://irc.freenode.net/>t
> >>
>
>

Re: merge-apache-drill-master-onto-mapr-master - Build # 1454 - Still Failing!

2021-01-27 Thread Ted Dunning

Huh.  I will look at it.

I think that there is a build system that could easily still have old email
configured. As you point out, that should be updated.


On Wed, Jan 27, 2021, 1:21 AM Niels Basjes  wrote:

> Hi,
>
> I got this email below.
> Seems like at HPE they still have something running under the Mapr name
> that is sending out emails about Apache Drill.
> Any ideas on who manages this?
>
> Perhaps Ted has an idea who to forward this to?
>
> Niels Bassjes
>
>
> -- Forwarded message -
> From: 
> Date: Tue, 26 Jan 2021, 15:27
> Subject: merge-apache-drill-master-onto-mapr-master - Build # 1454 - Still
> Failing!
> To: , Drill QA , Drill Dev Team <
> drilldevt...@mapr.com>
>
>
> merge-apache-drill-master-onto-mapr-master - Build # 1454 - Still Failing:
> Check console output at
> http://10.10.100.187:8080/job/merge-apache-drill-master-onto-mapr-master/1454/
> to view the results.
>

Re: [DISCUSSION] ARM-based compatibility tests

2021-01-26 Thread Ted Dunning

I did some minimal testing in embedded mode way back, but nothing serious.

I saw no issues at all.

On Tue, Jan 26, 2021 at 2:53 AM luoc  wrote:

> Hi all,
>
> I have some ARM-based machines (Not X86 architecture), and then want to do
> ARM-based compatibility tests. I know that Netty must bump to 4.1 at first
> (See also #9804 ). do we have
> anything else to upgrade? thanks for your time.
>
> Kind regards
> luoc

Re: [DISCUSSION] Roles and Privileges, Security, Secrets

2021-01-20 Thread Ted Dunning

I think that pushing too much of this kind of authentication and
authorization logic into Drill has a large complexity risk. Anything to do
with kerberos magnifies that complexity.

I also think that it is a mistake to depend on user identity if
authorization tokens are likely to need to be embedded in scripts and such.
Identity that is inherited can work that way, but identity that has to be
given to a script should use an alternative intended for workload
authorization such as SPIFFE.

Is there a reason that most or all of this couldn't be handled by storing
the configuration in files? That would allow file permissions to naturally
allow or disallow these operations.

Also, what are the specific goals here?

On Wed, Jan 20, 2021 at 3:34 PM Vitalii Diravka  wrote:

> Hi Dev and User,
>
> Drill has a very important feature - Roles and Privileges [1], but it has
> really weak functionality. There are only two roles (admin and user) and
> admin can't really give any user permissions to set query options for all
> their sessions or to allow configure storage plugin in other manner, etc.
>
> I think it is necessary to make this functionality broader: introduce a
> middle layer user-system options, the ability to change some configs of
> Storage Plugins for users, possibly permission for UDF creation etc. The
> main thing that this functionality requires good support for management of
> users and their secrets (credentials).
>
> There is a very good tool  - Hashicorp Vault [2], which can provide Drill a
> mechanism to store secrets in a safe manner, to deliver the secrets via
> tokens mechanism to the proper users and it can be integrated with Kerberos
> and Spnego.
>
> What do you think? Can we integrate Drill with Vault or no, what additional
> pros and cons of this decision? If it is a good decision I can start
> preparing design for this functionality
>
>
> [1] https://drill.apache.org/docs/roles-and-privileges/
> [2] https://www.vaultproject.io/
>
> Kind regards
> Vitalii
>

Re: [VOTE]: James Turton for Committer

2020-11-05 Thread Ted Dunning

I think that looks like a great addition.

+1 for James.

I don't think that lazy consensus is a great idea, however. Happily, you
now have three positives.


On Wed, Nov 4, 2020 at 11:00 AM Charles Givre  wrote:

> Hello all,
> I'd like to call a vote for James Turton for committer.  James has been
> doing some serious work on the Drill website and documentation.   I'd like
> to use the lazy consensus option for this vote (
> https://www.apache.org/foundation/voting.html#LazyConsensus <
> https://www.apache.org/foundation/voting.html#LazyConsensus>). This means
> that silence implies consent.   If nobody objects by Monday, I'll assume
> lazy consensus and proceed.
>
> Thanks!
> -- C

Re: SSL auth fails after upgrading Netty version

2020-09-17 Thread Ted Dunning

Hey Alka,

I am a little confused about motivations here.

Can you say whether you get what you need from 1.18?


On Thu, Sep 17, 2020 at 6:49 AM alka kumari 
wrote:

> Hi Team,
>
> I am using Apache Drill JDBC project.
> I built Drill client 1.17 with Netty version 4.1.50.Final. There are some
> breaking changes because of Netty upgrade and I need to make changes in the
> below files:-
>
> 1.)
> common\src\main\java\org\apache\drill\common\collections\MapWithOrdinal.java
> 2.)
> drill\exec\java-exec\src\main\java\org\apache\drill\exec\record\DeadBuf.java
> 3.) drill\exec\memory\base\src\main\java\io\netty\buffer\DrillBuf.java
> 4.)
> drill\exec\memory\base\src\main\java\io\netty\buffer\MutableWrappedByteBuf.java
> 5.)
> drill\exec\memory\base\src\main\java\org\apache\drill\exec\memory\DrillByteBufAllocator.java
>
> All these files got some newly overridden methods. (Changes are on my
> local machine)
>
> *ISSUE:*
> With newly built jars  SSL connection is not working  and throwing below
> errors: (plain auth without SSL is working)
> *Connecting to the server timed out. This is sometimes due to a mismatch
> in the SSL configuration between client and server. [ Exception: Waited
> 1 milliseconds for
> org.apache.drill.shaded.guava.com.google.common.util.concurrent.SettableFuture@6ea2bc93[status=PENDING]]*
>
> *Exception: Waited 1 milliseconds for
> org.apache.drill.shaded.guava.com.google.common.util.concurrent.SettableFuture@6ea2bc93[status=PENDING]
> - This particular Exception is coming from AbstractFuture.java
> (drill-shaded-guava-23.0.jar). *Attaching the image for reference.
>
> *Can I get some suggestions on this issue and could you tell me whether
> the issue is in the drill-shaded-guava side that can probably be fixed in
> Drill client 1.18 (As drill client 1.18 drill-shaded-guava has been
> upgraded to 28.2-jre).*
>
> NOTE: I have tested the SSL configuration on the connection string and
> server-side. They are correct.'
> Looking for some help.
>
> Thanks &  Regards,
> Alka Kumari
>

Re: Calcite Question for you

2020-06-24 Thread Ted Dunning

Charles,

Do you think that suggesting using a non-materialized view would help in
this case?

On Wed, Jun 24, 2020 at 2:31 PM Charles Givre  wrote:

> Hi Vova,
> I have a Calcite question for you.  I’m helping someone debug an issue
> they’re having connecting Drill to an OracleDB and they’re running into the
> issue you reported here:
> https://issues.apache.org/jira/browse/CALCITE-3533 <
> https://issues.apache.org/jira/browse/CALCITE-3533>
>
> I’m wondering if there’s a way that we could fix this on the Drill side
> since it doesn’t look like Calcite is going to fix this.  What do you think?
> Best,
> — C

Re: compile issue with MapR repo is being worked

2020-05-04 Thread Ted Dunning

Forwarded.



On Mon, May 4, 2020 at 6:15 AM Charles Givre  wrote:

> HI Ted, Vova,
> My PR is still blocked by the MapR repos.  After reverting back to the
> HTTP repo (which does seem to be working) we're now getting the following
> error:
>
> [ERROR] Failed to execute goal on project drill-format-mapr: Could not
> resolve dependencies for project
> org.apache.drill.contrib:drill-format-mapr:jar:1.18.0-SNAPSHOT: Could not
> transfer artifact com.mapr.hadoop:maprfs:jar:6.1.0-mapr from/to
> mapr-releases (http://repository.mapr.com/maven/): GET request of:
> com/mapr/hadoop/maprfs/6.1.0-mapr/maprfs-6.1.0-mapr.jar from mapr-releases
> failed: Premature end of Content-Length delimited message body (expected:
> 67,884,262; received: 47,333,376) -> [Help 1]
> 3146
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3146>[ERROR]
>
> 3147
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3147>[ERROR]
> To see the full stack trace of the errors, re-run Maven with the -e switch.
> 3148
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3148>[ERROR]
> Re-run Maven using the -X switch to enable full debug logging.
> 3149
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3149>[ERROR]
>
> 3150
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3150>[ERROR]
> For more information about the errors and possible solutions, please read
> the following articles:
> 3151
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3151>[ERROR]
> [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> 3152
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3152>[ERROR]
>
> 3153
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3153>[ERROR]
> After correcting the problems, you can resume the build with the command
> 3154
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3154>[ERROR]
>  mvn  -rf :drill-format-mapr
> 3155
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=642773818#step:6:3155>##[error]Process
> completed with exit code 1.
>
> Thanks for your help on quickly addressing this issue.
> -- C
>
> > On May 4, 2020, at 2:48 AM, Vova Vysotskyi  wrote:
> >
> > Hi Ted,
> >
> > Thanks for your help! It looks like for http protocol this issue was
> > resolved.
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Mon, May 4, 2020 at 4:19 AM Charles Givre  wrote:
> >
> >> Hi Ted,
> >> Thanks for your help.  You can view the logs here:
> >> https://github.com/apache/drill/pull/2067 <
> >> https://github.com/apache/drill/pull/2067> in the CI stuff.
> >> -- C
> >>
> >>
> >>
> >>
> >>> On May 3, 2020, at 9:16 PM, Ted Dunning  wrote:
> >>>
> >>> I will pass the word.
> >>>
> >>> Do you have logs?
> >>>
> >>>
> >>> On Sun, May 3, 2020 at 4:15 PM Charles Givre  wrote:
> >>>
> >>>> Hi Ted,
> >>>> Thanks for looking into this so quickly.  Unfortunately, I re-ran the
> CI
> >>>> jobs from github and it is still producing the same errors.
> >>>> Best,
> >>>> --C
> >>>>
> >>>>> On May 3, 2020, at 5:58 PM, Ted Dunning 
> wrote:
> >>>>>
> >>>>> It appears that the certificate issue is resolved.
> >>>>>
> >>>>> Can somebody verify this by doing a compilation?
> >>>>>
> >>>>> I have to add that based on the number of off-line and on-list pings
> I
> >>>> got
> >>>>> about this issue I can say that there were quite a few people
> compiling
> >>>>> Drill on a Sunday morning. That bodes well, I think, for community
> >>>> health.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sun, May 3, 2020 at 11:27 AM Ted Dunning 
> >>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> I just got word back that the team is looking at the issue.
> >>>>>>
> >>>>>> Not surprisingly, their first look indicates that the issue isn't
> what
> >>>> it
> >>>>>> appears to be (i.e. not a bad cert)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: compile issue with MapR repo is being worked

2020-05-03 Thread Ted Dunning

I will pass the word.

Do you have logs?


On Sun, May 3, 2020 at 4:15 PM Charles Givre  wrote:

> Hi Ted,
> Thanks for looking into this so quickly.  Unfortunately, I re-ran the CI
> jobs from github and it is still producing the same errors.
> Best,
> --C
>
> > On May 3, 2020, at 5:58 PM, Ted Dunning  wrote:
> >
> > It appears that the certificate issue is resolved.
> >
> > Can somebody verify this by doing a compilation?
> >
> > I have to add that based on the number of off-line and on-list pings I
> got
> > about this issue I can say that there were quite a few people compiling
> > Drill on a Sunday morning. That bodes well, I think, for community
> health.
> >
> >
> >
> > On Sun, May 3, 2020 at 11:27 AM Ted Dunning 
> wrote:
> >
> >>
> >> I just got word back that the team is looking at the issue.
> >>
> >> Not surprisingly, their first look indicates that the issue isn't what
> it
> >> appears to be (i.e. not a bad cert)
> >>
> >>
> >>
>
>

Re: Drill with No-SQL [was: Cannot Build Drill "exec/Java Execution Engine"]

2020-05-03 Thread Ted Dunning

I didn't mention Presto on purpose. It is a fine tool, but the community is
plagued lately by a fork. That can be expected to substantially inhibit
adoption and I think that is just what I have seen. It used to be that
people asked about Presto every other time I was on a call and I haven't
heard even one such question in over a year. The community may recover from
this, but it is hard to say whether they can regain their momentum.

In case anybody wants to sample the confusion, here are the two "official"
homes on github:

https://github.com/prestodb/presto
https://github.com/prestosql/presto

The worst part is that neither fork seems to dominate the other. With the
Hudson/Jeeves fork, at least, Hudson basically dies while Jenkins continued
with full momentum. Here, both sides seem to be splitting things much too
evenly.



On Sun, May 3, 2020 at 2:42 PM Paul Rogers 
wrote:

> Hi Tug,
>
> Glad to hear from you again. Ted's summary is pretty good; here's a bit
> more detail.
>
>
> Presto is another alternative which seems to have gained the most traction
> outside of the Cloud ecosystem on the one hand, and the
> Cloudera/HortonWorks ecosystem on the other. Presto does, however, demand
> that you have a schema, which is often an obstacle for many applications.
>
> Most folks I've talked to who tried to use Spark for this use case came
> away disappointed. Unlike Drill (or Presto or Impala), Spark wants to start
> new Java processes for each query. Makes great sense for large, complex
> map/reduce jobs, but is a non-starter for small, interactive queries.
>
> Hive also is trying to be an "uber query layer" and has integrations with
> multiple systems. But, Hive's complexity makes Drill look downright simple
> by comparison. Hive also needs an up-front schema.
>
>
> I've had the opportunity to integrate Drill with two different noSQL
> engines. Getting started is easy, especially if a REST or similar API is
> available. Filter push-down is the next step as otherwise Drill will simply
> suck all data from your DB as it it were a file. We've added some structure
> in the new HTTP reader to make it a bit easier than it used to be to create
> this kind of filter push-down. (The other kind of filter push-down is for
> partition pruning used for files, which you probably won't need.)
>
> Aside from the current MapR repo issues, Drill tends to be much easier to
> build than other systems. Pretty much set up Java and the correct Maven and
> you're good to go. If you run unit tests, there is one additional library
> to install, but the tests themselves tell you you exactly what is needed
> when they fail the first time (which I how I learned about it.)
>
>
> After that, performance will point the way. For example, does your DB have
> indexes? If so, then you can leverage the work originally done for MapR-DB
> to convey index information to Calcite so it can pick the best execution
> plan. There are specialized operators for index key lookup as well.
>
> All this will get you the basic one-table scan which is often all that
> no-SQL DBs ever need. (Any structure usually appears within each document,
> rather than as joined table as in the RDBMS world.) However, if your DB
> does need joins, you will need something like Calcite to work out the
> tradeoffs of the various join+filter-push plans possible, especially if
> your DB supports multiple indexes. There is no escaping the plan-time
> complexity of these cases. Calcite is big and complex, but it does give you
> the tools needed to solve these problems.
>
> If your DB is to be used to power dashboards (summaries of logs, time
> series, click streams, sales or whatever), you'll soon find you need to
> provide a caching/aggregation layer to avoid banging on your DB each time
> the dashboard refreshes. (Imagine a 1-week dashboard, updated every minute,
> where only the last hour has new data.) Drill becomes very handy as a way
> of combining data from a mostly-static caching layer (data for the last 6
> days, say) with your live DB (for the last one day, say.)
>
> If you provide a "writer" as well as a "reader", you can use Drill to load
> your DB as well as query it.
>
>
> Happy to share whatever else I might have learned if you can describe your
> goals in a bit more detail.
>
> Thanks,
> - Paul
>
>
>
> On Sunday, May 3, 2020, 11:25:11 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  The compile problem is a problem with the MapR repo (I think). I have
> reported it to the folks who can fix it.
>
> Regarding the generic question, I think that Drill is very much a good
> choice for putting a SQL layer on a noSQL database.
>
> It is definitely the case that the communit

Re: compile issue with MapR repo is being worked

2020-05-03 Thread Ted Dunning

It appears that the certificate issue is resolved.

Can somebody verify this by doing a compilation?

I have to add that based on the number of off-line and on-list pings I got
about this issue I can say that there were quite a few people compiling
Drill on a Sunday morning. That bodes well, I think, for community health.

On Sun, May 3, 2020 at 11:27 AM Ted Dunning  wrote:

>
> I just got word back that the team is looking at the issue.
>
> Not surprisingly, their first look indicates that the issue isn't what it
> appears to be (i.e. not a bad cert)
>
>
>

compile issue with MapR repo is being worked

2020-05-03 Thread Ted Dunning

I just got word back that the team is looking at the issue.

Not surprisingly, their first look indicates that the issue isn't what it
appears to be (i.e. not a bad cert)

Re: Cannot Build Drill "exec/Java Execution Engine"

2020-05-03 Thread Ted Dunning

The compile problem is a problem with the MapR repo (I think). I have
reported it to the folks who can fix it.

Regarding the generic question, I think that Drill is very much a good
choice for putting a SQL layer on a noSQL database.

It is definitely the case that the community is much broader than it used
to be. A number of companies now use Drill in their products which is
one of the best ways to build long-term community.

There are alternatives, of course. All have trade-offs (because we live in
the world):

- Calcite itself (what Drill uses as a SQL parser and optimizer) can be
used, but you have to provide an execution framework and you wind up with
something that only works for your engine and is unlikely to support
parallel operations. Calcite is used by lots of projects, though, so it is
has a very broad base of support.

- Spark SQL is fairly easy to extend (from what I hear from friends) but
the optimizer doesn't deal well with complicated tradeoffs (precisely
because it is fairly simple). You also wind up with the baggage of spark
which could be good or bad. You would get some parallelism, though. I don't
think that Spark SQL handles complex objects, however.

- Postgres has a long history of having odd things grafted onto it. I know
little about this other than seeing the results. Extending Postgres would
not likely give you any parallelism, but there might be a way to support
complex objects through Postgres JSON object support.




On Sun, May 3, 2020 at 11:09 AM Tugdual Grall  wrote:

> Hello
>
> It has been a long time since I used Drill!
>
> I wanted to build it to start to work on a new datasource,.
>
> But when run  "mvn clean install", I hit the exception below.
>
> => Can somebody help?
>
> => This bring me to a generic question, if I want to expose a NoSQL
> database using SQL/JDBC/ODBC for Analytics purposes, is Drill the best
> option? or I should look at something else?
>
>
> Thanks!
>
> 
> [INFO] exec/Java Execution Engine . FAILURE [
>  0.676 s]
>
> [ERROR] Failed to execute goal on project drill-java-exec: Could not
> resolve dependencies
> for project org.apache.drill.exec:drill-java-exec:jar:1.18.0-SNAPSHOT:
> Failed to collect dependencies at org.kohsuke:libpam4j:jar:1.8-rev2: Failed
> to read artifact descriptor for org.kohsuke:libpam4j:jar:1.8-rev2: Could
> not transfer artifact org.kohsuke:libpam4j:pom:1.8-rev2 from/to
> mapr-releases (http://repository.mapr.com/maven/): Transfer failed for
>
> http://repository.mapr.com/maven/org/kohsuke/libpam4j/1.8-rev2/libpam4j-1.8-rev2.pom
> 500 Proxy Error -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
>
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :drill-java-exec
>

Re: More CI Issues

2020-05-03 Thread Ted Dunning

I have asked the MapR team to look into the certificate issue.



On Sun, May 3, 2020 at 8:03 AM Charles Givre  wrote:

> Hello all,
> In a recent PR that had nothing to do with this, I saw that the CI is now
> failing due to the following error.  I did some digging and found that the
> URL http://repository.mapr.com/maven/ 
> is no longer accessible.  However with SSL
> https://repository.mapr.com/maven/ 
> IS available but with bad certificates.
>
> I'm not a maven with maven, but is there some way to easily fix this so
> that Drill isn't trying to pull from non-existent or broken MapR repos?
> Best,
> -- C
>
>
> [ERROR] Failed to execute goal on project drill-java-exec: Could not
> resolve dependencies for project
> org.apache.drill.exec:drill-java-exec:jar:1.18.0-SNAPSHOT: Failed to
> collect dependencies at org.kohsuke:libpam4j:jar:1.8-rev2: Failed to read
> artifact descriptor for org.kohsuke:libpam4j:jar:1.8-rev2: Could not
> transfer artifact org.kohsuke:libpam4j:pom:1.8-rev2 from/to mapr-releases (
> http://repository.mapr.com/maven/): Transfer failed for
> http://repository.mapr.com/maven/org/kohsuke/libpam4j/1.8-rev2/libpam4j-1.8-rev2.pom
> 500 Proxy Error -> [Help 1]
> 1094
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1094>[ERROR]
>
> 1095
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1095>[ERROR]
> To see the full stack trace of the errors, re-run Maven with the -e switch.
> 1096
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1096>[ERROR]
> Re-run Maven using the -X switch to enable full debug logging.
> 1097
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1097>[ERROR]
>
> 1098
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1098>[ERROR]
> For more information about the errors and possible solutions, please read
> the following articles:
> 1099
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1099>[ERROR]
> [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> 1100
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1100>[ERROR]
>
> 1101
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1101>[ERROR]
> After correcting the problems, you can resume the build with the command
> 1102
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1102>[ERROR]
>  mvn  -rf :drill-java-exec
> 1103
>  <
> https://github.com/apache/drill/pull/2067/checks?check_run_id=640588163#step:6:1103>##[error]Process
> completed with exit code 1.

Re: Timestamp Issue

2020-02-06 Thread Ted Dunning

That is really frustrating because that timestamp is literally in an ISO
8601 format.

https://en.wikipedia.org/wiki/ISO_8601

It would be nice if these formats just worked by default.




On Thu, Feb 6, 2020 at 5:05 AM Charles Givre  wrote:

> Hi Drill Devs
> I'm having a small issue interpreting timestamps from data.  The data in
> question is in both CSV and parquet format, and has dates encoded as
> strings in the following format:
>
> 1998-07-14T04:00:00
>
> The issue I'm encountering is dealing with the literal T.  The JODA
> instructions state that you can escape a literal with a single quote.
> IE:
>
> -MM-dd'T'hh:MM:00
>
> However, the issue here is that since Drill does not allow double quotes,
> all these need to be escaped.
>
> -MM-dd\'T\'hh:MM:00
>
> But... this just doesn't seem to work.  I'm using the TO_TIMESTAMP()
> function.  Any suggestions?
> Thanks,
> -- C

Re: [DISCUSS]: Thoughts

2020-01-30 Thread Ted Dunning

Igor,

Good documentation and first 5-minute experience are very important, but
not because a long-term contributor will see it and commit their spare time
for the next five years on that basis. It is more about preventing early
attrition of contributors who might find the project very exciting due to
silly factors. That can easily happen if the documentation is bad because
it increases the frustration a potential contributor feels early on. If
they can't try the software and get something interesting, then we are
likely to lose the battle for attention span.

And frankly, it isn't just the developer that we need to attract and
retain. A user who never contributes a line of code is part of the
community and can easily be a net positive if they only report problems and
occasionally tell people what they are doing.



On Thu, Jan 30, 2020 at 7:00 AM Igor Guzenko 
wrote:

> Hello Charles,
>
> Thank you very much for starting this important discussion. These are all
> important things but at the current moment, I don't have a clear vision
> where we could start from. The item which is most interesting to me is the
> second one, but I've never been involved in building an open-source
> community, don't even know where to start. I'm not sure that just making
> good documentation and the first impression will attract developers with
> strong motivation to contribute.  So I'm very excited to learn about
> projects which managed to build such a community, maybe we really could
> find some new fresh ideas about how to attract new community members.
>
> Thanks,
> Igor
>
> On Thu, Jan 30, 2020 at 4:18 PM Charles Givre  wrote:
>
> > Hello all,
> > I mentioned in the Drill hangout last week that I had spoken with one of
> > the original mentors for the Drill project (Isabel Drost-Fromm) and asked
> > her advice about the future of Drill.  To paraphrase what she told me:
> >
> > 1.  There are two ways for open source projects to succeed.   The first
> > and riskier approach is with a single corporate sponsor.  The obvious
> risks
> > are that since the corporate sponsor is footing the bill, they will
> > prioritize their own needs over and sometimes against community needs.
> > (This is not unique to Drill). The slower but less risky approach is to
> > build a community around a project, join forces and slowly drive it
> > forward.  She pointed out that some of the Apache foundation's longest
> > running projects were run in this way.
> >
> > 2.  We should focus our efforts on community building:  She suggested a
> > lot of what she described as "would be obvious in retrospect" such as
> > making sure the documentation is really solid, and having a user
> experience
> > in the beginning.  She said we should use the resources of the Apache
> > foundation to help publicize new releases etc.  Also we should make it
> easy
> > to become a committer.   IMHO, I would add that we really should seek out
> > additional code reviewers as we don't have enough and PRs take a long
> time
> > to get approved.
> >
> > 3.  Do a lot of what a vendor would do:  Update the website and
> > documentation to reflect things like: who is using Drill, who is offering
> > professional support for Drill etc.
> >
> > 4.  Define a mission:  We should work to define a mission for Drill?  IE
> > Why does/should it exist and what business problem does it solve?  IMHO
> it
> > solves a very large one, but more people need to know about it.  That's
> why
> > I'm not giving up on it yet.
> >
> >
> > @Isabel, I hope I captured the essence of what you were telling me here.
> >
> > Thanks everyone,
> > --C
> >
> >
> >
> >
>

Re: Drill Build Issues

2019-10-31 Thread Ted Dunning

I still don't understand how that came to be.

It needs to be made more independent.



On Thu, Oct 31, 2019 at 12:31 PM Charles Givre  wrote:

> It came back on this morning.  However it certainly suggests that we
> should remove anything that is dependent on MapR infrastructure.
>
> Sent from my iPhone
>
> > On Oct 31, 2019, at 15:25, Jinfeng Ni  wrote:
> >
> > Can some MapR folks help see what happened to the maven repo hosted at
> > http://repository.mapr.com/maven/?  Will MapR continue to host such
> > repo?
> >
> >
> >> On Thu, Oct 31, 2019 at 5:20 AM Charles Givre  wrote:
> >>
> >> Hello all,
> >> I'm having some issues building Drill from source.  I'm getting an
> error something like "Could not transfer artifact from/to (
> http://repository.mapr.com/maven ).
> When I attempt to go to that URL I see that it is down.   I'm not a maven
> expert, but I attempted to remove this from the main pom.xml, but that was
> clearly not the answer as then Drill wouldn't build at all.
> >>
> >> Thanks!
> >> -- C
> >>
> >>
> >> Here is the full error message:
> >> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process
> (process-resource-bundles) on project distribution: Failed to resolve
> dependencies for one or more projects in the reactor. Reason: Unable to get
> dependency information for cisd:jhdf5:jar:14.12.6: Failed to retrieve POM
> for cisd:jhdf5:jar:14.12.6: Could not transfer artifact
> cisd:jhdf5:pom:14.12.6 from/to mapr-releases (
> http://repository.mapr.com/maven/): Failed to transfer file
> http://repository.mapr.com/maven/cisd/jhdf5/14.12.6/jhdf5-14.12.6.pom
> with status code 503
> >> [ERROR]   cisd:jhdf5:jar:14.12.6
> >> [ERROR]
> >> [ERROR] from the specified remote repositories:
> >> [ERROR]   conjars (http://conjars.org/repo, releases=true,
> snapshots=false),
> >> [ERROR]   mapr-releases (http://repository.mapr.com/maven/,
> releases=true, snapshots=false),
> >> [ERROR]   sonatype-nexus-snapshots (
> https://oss.sonatype.org/content/repositories/snapshots, releases=false,
> snapshots=true),
> >> [ERROR]   jitpack.io (https://jitpack.io, releases=true,
> snapshots=true),
> >> [ERROR]   apache.snapshots (https://repository.apache.org/snapshots,
> releases=false, snapshots=true),
> >> [ERROR]   central (https://repo.maven.apache.org/maven2,
> releases=true, snapshots=false)
> >> [ERROR] Path to dependency:
> >> [ERROR] 1) org.apache.drill:distribution:pom:1.17.0-SNAPSHOT
> >> [ERROR] 2)
> org.apache.drill.contrib:drill-format-hdf5:jar:1.17.0-SNAPSHOT
> >> [ERROR]
> >> [ERROR]
> >> [ERROR] -> [Help 1]
> >> [ERROR]
> >> [ERROR] To see the full stack trace of the errors, re-run Maven with
> the -e switch.
> >> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> >> [ERROR]
> >> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> >> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> >> [ERROR]
> >> [ERROR] After correcting the problems, you can resume the build with
> the command
> >> [ERROR]   mvn  -rf :distribution
> >>
>

Re: [Discuss] Minor Release

2019-09-27 Thread Ted Dunning

Yes.



On Fri, Sep 27, 2019 at 11:09 AM Charles Givre  wrote:

> Hello all
> There was a recent email to the user group about a blocking issue with
> sqlline.  The issue was resolved in the latest version of sqlline however
> it was preventing a user from executing queries.  In a situation like this
> where a simple upgrade of a library fixes a major issue, would it make
> sense to release a minor upgrade only including the updated library?
>
> Sent from my iPhone

Re: [DISCUSS]: PCAP Reader Improvements

2019-09-22 Thread Ted Dunning

Another though is to have an alternative (potential) map field for each
possible protocol.

Thus, you would have a map for the DNS protocol and a map for the ICMP and
so on. This would allow each map to have a fixed format.



On Sun, Sep 22, 2019 at 9:46 AM Charles Givre  wrote:

> Hi Ted,
> EVF = Enhanced Vector Framework. Complete tutorial here:
> https://github.com/paul-rogers/drill/wiki/Developer%27s-Guide-to-the-Enhanced-Vector-Framework#basics-tutorial
> <
> https://github.com/paul-rogers/drill/wiki/Developer's-Guide-to-the-Enhanced-Vector-Framework#basics-tutorial
> >
> Basically, what I was thinking was that we can use the EVF to define the
> schema for known columns (IE level 1 & 2 headers).  EVF handles pushdown
> projection so we could eliminate a lot of that logic in the plugin.  Then
> EVF also allows dynamic schema discovery, so we could create a map called
> packet_data or whatever, and that would be populated with whatever fields
> exist in the packet.  We would need to write or otherwise obtain protocol
> dissectors for the different protocols but I'm going to start wtih DNS
> since I need that for work.   I'm pretty sure that the EVF allows for
> variant maps so if you have a DNS packet and a ICMP packet, you'd get
> different fields in the map.
> -- C
>
>
>
>
> > On Sep 22, 2019, at 11:30 AM, Ted Dunning  wrote:
> >
> > This sounds amazing.
> >
> > Some questions.
> >
> > What is EVF?
> >
> > How can you deal with the problem of variant maps?
> >
> > On Sun, Sep 22, 2019, 7:55 AM Charles Givre  wrote:
> >
> >> Hello all,
> >> I'm contemplating some improvements to Drill's PCAP reader.
> Specifically,
> >> I'd like for Drill to actually be able to parse some of the actual
> packet
> >> data.  I was thinking of using KaiTai structs as a means to do so as
> they
> >> already have parsers for common packets.  An example of this is the DNS
> >> parser (https://formats.kaitai.io/dns_packet/java.html)
> >>
> >> I was thinking of doing the following:
> >> 1.  Converting the PCAP plugin to use the EVF framework.
> >> 2.  Including a config option to turn the parsing on/off
> >> 3.  Having the appropriate parser read and parse the data and store it
> >> into a Drill map.
> >>
> >> Does anyone have any comments or thoughts on the matter?
> >> Thanks,
> >> -- C
> >>
> >>
>
>

Re: [DISCUSS]: PCAP Reader Improvements

2019-09-22 Thread Ted Dunning

This sounds amazing.

Some questions.

What is EVF?

How can you deal with the problem of variant maps?

On Sun, Sep 22, 2019, 7:55 AM Charles Givre  wrote:

> Hello all,
> I'm contemplating some improvements to Drill's PCAP reader.  Specifically,
> I'd like for Drill to actually be able to parse some of the actual packet
> data.  I was thinking of using KaiTai structs as a means to do so as they
> already have parsers for common packets.  An example of this is the DNS
> parser (https://formats.kaitai.io/dns_packet/java.html)
>
> I was thinking of doing the following:
> 1.  Converting the PCAP plugin to use the EVF framework.
> 2.  Including a config option to turn the parsing on/off
> 3.  Having the appropriate parser read and parse the data and store it
> into a Drill map.
>
> Does anyone have any comments or thoughts on the matter?
> Thanks,
> -- C
>
>

Re: Anybody at Apachecon Vegas

2019-09-11 Thread Ted Dunning

Ah... bummer.

Ellen just said that you would be here. We will pass the word to correct
that.

Definitely do take care of the priority stuff. Hope it turns out well.




On Wed, Sep 11, 2019 at 9:17 AM Charles Givre  wrote:

> I was planning on being there but had a family medical situation so I
> wasn't able to attend :-(. Next time!
>
> > On Sep 11, 2019, at 12:14 PM, Aman Sinha  wrote:
> >
> > Yes, I am here today.  See you guys soon.
> >
> > -Aman
> >
> > On Tue, Sep 10, 2019 at 9:59 PM Ted Dunning 
> wrote:
> >
> >> I am here. So is Ellen.  I think Aman as well. Come to the Drill track
> >> tomorrow. Both Ellen and I have talks.
> >>
> >>
> >>
> >> On Tue, Sep 10, 2019 at 2:16 PM Naresh Bhat 
> >> wrote:
> >>
> >>> Hi Guys,
> >>>
> >>> Anybody attending Apachecon at Vegas ?
> >>>
> >>> Regards
> >>> -Naresh
> >>>
> >>
>
>

Re: Anybody at Apachecon Vegas

2019-09-10 Thread Ted Dunning

I am here. So is Ellen.  I think Aman as well. Come to the Drill track
tomorrow. Both Ellen and I have talks.



On Tue, Sep 10, 2019 at 2:16 PM Naresh Bhat  wrote:

> Hi Guys,
>
> Anybody attending Apachecon at Vegas ?
>
> Regards
> -Naresh
>

Re: MapR nexus server

2019-08-13 Thread Ted Dunning

We will have a plan in place to avoid disruption of project builds.



On Tue, Aug 13, 2019 at 10:07 AM Catherine Lyman  wrote:

> Here is a message from Kevin who runs the MapR DevOps team...
>
> Kevin Cheng
> Mon, Aug 12, 1:23 PM (20 hours ago)
> to me
> We will have it for next 3 months for sure. The long term plan is still
> TBD.
>
> On Mon, Aug 12, 2019 at 11:39 AM Julian Hyde  wrote:
>
> > Hi Drill devs,
> >
> > I’d like to raise a concern from the Calcite dev team. Of course Drill
> > depends on Calcite, but Calcite also depends on Drill: we use
> > drill-fmpp-plugin[1] in our build. This plugin looks for resources in
> > http://repository.mapr.com/maven/ 
> > (because of its dependency on drill-root[2]).
> >
> > The concern is whether repository.mapr.com will be available in the long
> > term. If that repository were to go away, a lot of projects’ builds might
> > start breaking.
> >
> > Are there plans to address this concern?
> >
> > Julian
> >
> > [1] https://github.com/apache/drill/blob/master/tools/fmpp/pom.xml <
> > https://github.com/apache/drill/blob/master/tools/fmpp/pom.xml>
> >
> > [2] https://github.com/apache/drill/blob/master/pom.xml <
> > https://github.com/apache/drill/blob/master/pom.xml>
>
>
>
> --
>
> *__*
>
> *Catherine C. Lyman*
>
> *We make the 'F' stand for 'Fabulous' in RTFM*
>
> VP, Technical Documentation
> 4555 Great America Pkwy, Santa Clara, CA 95054
> cly...@mapr.com
> 650.208.2865
> [image: MapR logo]
> <
> https://mapr.com/?utm_source=signature_medium=email_campaign=mapr-logo
> >
>

Re: complex data structure aggregators?

2019-08-12 Thread Ted Dunning

Charles,

That might work. The t-digest will give us a median estimate.



On Mon, Aug 12, 2019 at 4:33 PM Charles Givre  wrote:

> HI Ted,
> You might want to take a look at this repo:
> https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
> <
> https://github.com/cgivre/drill-stats-function/blob/master/src/main/java/org/apache/drill/contrib/function/DrillStatsFunctions.java
> >
> This was an experiment to see if I could write a function to calculate a
> median.  I found a streaming algorithm to do so, but it required the use of
> two stacks.  This was more of a "can I do this" type challenge than a "will
> this really work well" but I did get it to work.  In any event, the way I
> did it was to use the @Workspace and use an ObjectHolder.  Maybe this will
> help you out.
> -- C
>
>
> > On Aug 12, 2019, at 6:03 PM, Paul Rogers 
> wrote:
> >
> > Hi Ted,
> >
> > You are now at the point that you'll have to experiment. Drill provides
> an annotation for aggregate state:  @Workspace. The value must be declared
> as a "holder". You'll have to check if VarBinaryHolder is allowed, and, if
> so, how you allocate memory and remember the offset into the array. (My
> guess is that this may not work.)
> > @Workspace does allow you to specify a holder for a Java object, but
> such objects won't be spilled to disk when, say, the hash aggregate spills.
> This means your aggregate will work fine at small scale, then mysteriously
> fail once moved into production. Fun.
> >
> > Unless aggregate UDFs are special, they can return a VarChar or
> VarBinary result. The book explains how to do this for VarChar, some poking
> around in the Drill source should identify how to do so for VarBinary.
> (There are crufty details about allocating space, copying over data, etc.)
> >
> > FWIW: There is a pile of information on UDF internals on my GitHub Wiki.
> [1] Aggregate UDFS are covered in [2]. Once we learn the answers to your
> specific questions, we can add the info to the Wiki.
> >
> > Thanks,
> > - Paul
> >
> > [1]
> https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information
> >
> >
> > [2] https://github.com/paul-rogers/drill/wiki/Aggregate-UDFs
> >
> >
> >
> >
> >
> >
> >On Monday, August 12, 2019, 01:19:33 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
> >
> > I am trying to figure out how to build an approximate percentile
> estimator.
> >
> > I have a fancy data structure that will do this. It can live in bounded
> > memory with no allocation. I can add numbers to the digest easily enough.
> > And the required results can be extracted from the structure.
> >
> > What I would need to know:
> >
> > - how to use a fixed array of bytes as the state of an aggregating UDF
> >
> > - how to pass in an argument to an aggregator OR (better) how to use the
> > binary result of an aggregator in another function.
> >
> > On Mon, Aug 12, 2019 at 11:25 AM Charles Givre  wrote:
> >
> >> Ted,
> >> Can we ask what it is you are trying to build a UDF for?
> >> --C
> >>
> >>> On Aug 12, 2019, at 2:23 PM, Paul Rogers 
> >> wrote:
> >>>
> >>> Hi Ted,
> >>>
> >>> Thanks for the link; I suspected there was some trick for stddev. The
> >> point still stands that, if the algorithm requires multiple passes over
> the
> >> data (ML, say), can't be done in Drill.
> >>>
> >>> Each UDF must return exactly one value. It can return a map if you want
> >> multiple values (though someone would have to check that projection
> works
> >> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> >> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> >> accumulate a VarChar or VarBinary because Drill cannot insert values
> into
> >> an existing variable-length vector.
> >>>
> >>> UDFs need your knack for finding a workaround to get your job done;
> they
> >> have pretty strong limitations on the surface.
> >>>
> >>> Thanks,
> >>> - Paul
> >>>
> >>>
> >>>
> >>> On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> >> ted.dunn...@gmail.com> wrote:
> >>>
> >>> Is it possible for a UDF to produce multiple scalar results? Can it
> >> produce
> >>> a binary res

Re: complex data structure aggregators?

2019-08-12 Thread Ted Dunning

I am trying to figure out how to build an approximate percentile estimator.

I have a fancy data structure that will do this. It can live in bounded
memory with no allocation. I can add numbers to the digest easily enough.
And the required results can be extracted from the structure.

What I would need to know:

- how to use a fixed array of bytes as the state of an aggregating UDF

- how to pass in an argument to an aggregator OR (better) how to use the
binary result of an aggregator in another function.

On Mon, Aug 12, 2019 at 11:25 AM Charles Givre  wrote:

> Ted,
> Can we ask what it is you are trying to build a UDF for?
> --C
>
> > On Aug 12, 2019, at 2:23 PM, Paul Rogers 
> wrote:
> >
> > Hi Ted,
> >
> > Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
> >
> > Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
> >
> > UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
> >
> > Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> > a binary result?
> >
> > Also, as a nit, standard deviation doesn't require buffering all the
> data.
> > It just requires that you have three accumulators, one for count, one for
> > mean and one for mean squared deviation.  There is a slightly tricky
> > algorithm called Welford's algorithm
> > <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> > which
> > allows good numerical stability while computing this on-line.
> >
> > On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers 
> > wrote:
> >
> >> Hi Ted,
> >>
> >> Last I checked (when we wrote the book chapter on the subject),
> aggregate
> >> state are limited to scalars and Drill-defined types. There is no
> support
> >> to spill aggregate state, so that state will be lost if spilling is
> >> required to handle large aggregate batches. The current solution works
> for
> >> simple cases such as totals and averages.
> >>
> >> Aggregate UDFs share no state, so it is not possible for one function to
> >> use state accumulated by another. If, for example, you want sum, average
> >> and standard deviation, you'll have to accumulate the total three times,
> >> average twice, and so on. Note that the std dev function will require
> >> buffering all data in one's own array (without any spilling or other
> >> support), to allow computing the (X-bar - X)^2 part of the calculation.
> >>
> >> A UDF can emit a byte array (have to check it this is true of aggregate
> >> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> >> VarChar.
> >>
> >> All this is from memory and so is only approximately accurate. YMMV.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >> On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> >> ted.dunn...@gmail.com> wrote:
> >>
> >>   What is the current state of building aggregators that have complex
> state
> >> via UDFs?
> >>
> >> Is it possible to define multi-level aggregators in a UDF?
> >>
> >> Can the output of a UDF be a byte array?
> >>
> >>
> >> (these are three different questions)
> >>
>
>

Re: complex data structure aggregators?

2019-08-12 Thread Ted Dunning

Can UDFs accumulate a fixed length binary value?



On Mon, Aug 12, 2019 at 11:23 AM Paul Rogers 
wrote:

> Hi Ted,
>
> Thanks for the link; I suspected there was some trick for stddev. The
> point still stands that, if the algorithm requires multiple passes over the
> data (ML, say), can't be done in Drill.
>
> Each UDF must return exactly one value. It can return a map if you want
> multiple values (though someone would have to check that projection works
> to convert these to scalar top-level values). AFAIK, a UDF can produce a
> binary buffer as output (type VarBinary). But, an aggregate UDF cannot
> accumulate a VarChar or VarBinary because Drill cannot insert values into
> an existing variable-length vector.
>
> UDFs need your knack for finding a workaround to get your job done; they
> have pretty strong limitations on the surface.
>
> Thanks,
> - Paul
>
>
>
> On Monday, August 12, 2019, 10:59:56 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  Is it possible for a UDF to produce multiple scalar results? Can it
> produce
> a binary result?
>
> Also, as a nit, standard deviation doesn't require buffering all the data.
> It just requires that you have three accumulators, one for count, one for
> mean and one for mean squared deviation.  There is a slightly tricky
> algorithm called Welford's algorithm
> <
> https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm
> >
> which
> allows good numerical stability while computing this on-line.
>
> On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers 
> wrote:
>
> > Hi Ted,
> >
> > Last I checked (when we wrote the book chapter on the subject), aggregate
> > state are limited to scalars and Drill-defined types. There is no support
> > to spill aggregate state, so that state will be lost if spilling is
> > required to handle large aggregate batches. The current solution works
> for
> > simple cases such as totals and averages.
> >
> > Aggregate UDFs share no state, so it is not possible for one function to
> > use state accumulated by another. If, for example, you want sum, average
> > and standard deviation, you'll have to accumulate the total three times,
> > average twice, and so on. Note that the std dev function will require
> > buffering all data in one's own array (without any spilling or other
> > support), to allow computing the (X-bar - X)^2 part of the calculation.
> >
> > A UDF can emit a byte array (have to check it this is true of aggregate
> > UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> > VarChar.
> >
> > All this is from memory and so is only approximately accurate. YMMV.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  What is the current state of building aggregators that have complex
> state
> > via UDFs?
> >
> > Is it possible to define multi-level aggregators in a UDF?
> >
> > Can the output of a UDF be a byte array?
> >
> >
> > (these are three different questions)
> >
>

Re: complex data structure aggregators?

2019-08-12 Thread Ted Dunning

Is it possible for a UDF to produce multiple scalar results? Can it produce
a binary result?

Also, as a nit, standard deviation doesn't require buffering all the data.
It just requires that you have three accumulators, one for count, one for
mean and one for mean squared deviation.  There is a slightly tricky
algorithm called Welford's algorithm
<https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm>
which
allows good numerical stability while computing this on-line.

On Mon, Aug 12, 2019 at 9:01 AM Paul Rogers 
wrote:

> Hi Ted,
>
> Last I checked (when we wrote the book chapter on the subject), aggregate
> state are limited to scalars and Drill-defined types. There is no support
> to spill aggregate state, so that state will be lost if spilling is
> required to handle large aggregate batches. The current solution works for
> simple cases such as totals and averages.
>
> Aggregate UDFs share no state, so it is not possible for one function to
> use state accumulated by another. If, for example, you want sum, average
> and standard deviation, you'll have to accumulate the total three times,
> average twice, and so on. Note that the std dev function will require
> buffering all data in one's own array (without any spilling or other
> support), to allow computing the (X-bar - X)^2 part of the calculation.
>
> A UDF can emit a byte array (have to check it this is true of aggregate
> UDFs). A VarChar is simply a special kind of array, and UDFs can emit a
> VarChar.
>
> All this is from memory and so is only approximately accurate. YMMV.
>
> Thanks,
> - Paul
>
>
>
> On Monday, August 12, 2019, 07:35:47 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  What is the current state of building aggregators that have complex state
> via UDFs?
>
> Is it possible to define multi-level aggregators in a UDF?
>
> Can the output of a UDF be a byte array?
>
>
> (these are three different questions)
>

complex data structure aggregators?

2019-08-12 Thread Ted Dunning

What is the current state of building aggregators that have complex state
via UDFs?

Is it possible to define multi-level aggregators in a UDF?

Can the output of a UDF be a byte array?


(these are three different questions)

Re: [DISCUSS]: Drill after MapR

2019-08-08 Thread Ted Dunning

Assuming that HPE won't support continued development is not warranted.
There is a risk in all change, but assuming the loss of support isn't
correct.

That said, the community should be able to release independently of any
single company, assuming that Apache has enough resources.

On Thu, Aug 8, 2019 at 9:49 AM Charles Givre  wrote:

> Hello all,
> Now that MapR is officially part of HPE, I wanted to ask everyone their
> thoughts about how we can continue to release Drill without the MapR owned
> infrastructure.  Assuming that HPE is not likely to continue supporting
> Drill (or maybe they will, but I'm guessing not) what infrastructure does
> the community need to take over?
> Can we start compiling a list and formulating a plan for this?
> Thanks,
> -- C

Re: Apache Drill Hangout - July 9, 2019

2019-07-10 Thread Ted Dunning

It won't be possible to find a time that works for (Kiev, Germany, US (east
and west) and Asia.

Traditionally, Drill has mostly had contributors from EU and US so that
made it possible to find an overlapping time.

I would suggest that Hangout times be varied so that some hit the
traditional EU+US time, some hit EU+ASIA and some hit US+ASIA.

Bohan's suggestion that the choice be somewhat topic specific is a good
one. If the major contributor for a topic is in ASIA and the major advisor
on that topic is in EU, it makes little sense to prioritize US timezone.

I should also point out that the Drill project has a tremendous record of
publicizing these hangouts in advance and bringing any important insights
from these meetings back to the mailing list. I have almost never been able
to participate in real-time, but these reports back have helped me feel
like I am still involved.

On Wed, Jul 10, 2019 at 12:57 AM Bohdan Kazydub 
wrote:

> Hi Weijie,
>
> It'd be nice to hear about your recent work but it looks like the regular
> Hangout time is not convenient for you.
> Maybe you could give a talk on the next Hangout session?
> If you're still willing to do so, please let us know with suggestion of
> time that works for you in response to the email, so that Apache Drill
> community decides how to proceed with this (i.e. we find a convenient time
> that works for all interested in the topic).
>
> Kind regards,
> Bohdan Kazydub
>
> On Tue, Jul 9, 2019 at 2:37 AM weijie tong 
> wrote:
>
> > I could give a short talk about my recent work about parallel HashJoin
> and
> > something others.
> >
> > On Mon, Jul 8, 2019 at 7:28 PM Bohdan Kazydub 
> > wrote:
> >
> > > Hi Drillers,
> > >
> > > We will have our bi-weekly hangout tomorrow, July 9th, at 10 AM PST
> > > (link: https://meet.google.com/yki-iqdf-tai ).
> > >
> > > If there are any topics you would like to discuss during the hangout
> > please
> > > respond to this email.
> > >
> > > Kind regards,
> > > Bohdan Kazydub
> > >
> >
>

Re: Superset and Drill

2019-06-02 Thread Ted Dunning

Nice.

I started on this ages ago and got stalled.

So very nice that others had more stickem than me and actually followed
through on this.



On Sun, Jun 2, 2019 at 7:34 AM Charles Givre  wrote:

> Hello Everyone,
> I wanted to send this note to the Drill aliases but as of this weekend,
> the Drill integration with Superset is complete and merged. If you haven't
> seen Superset before, it's an open source BI/Data exploration tool that
> works with SQL-like databases.  It is implemented in Python and uses
> SQLAlchemy to connect to a variety of databases.  It also has a really
> powerful SQLLab feature which can be used to run ad-hoc queries.
>
> Here's a link to a tutorial about it:
> http://thedataist.com/visualize-anything-with-superset-and-drill/
>
> Big thanks to everyone who assisted including John Omernik for his work on
> the SQLAlchemy Drill dialect, and Ville Brofeldt for fixing some quirks in
> Superset.
> -- C
>
>

Re: [DISCUSSION] DRILL-7097 Rename MapVector to StructVector

2019-05-31 Thread Ted Dunning

Would it be possible to call the new structure a Dict (following Python's
inspiration)?

That would avoid the large disruption of renaming Map*.



On Fri, May 31, 2019 at 10:10 AM Paul Rogers 
wrote:

> Hi Igor,
>
> Thank you for finally addressing a long-running irritation: that the Drill
> Map type is not a map, it is a tuple.
>
> Perhaps you can divide the discussion into three parts.
>
> 1. Renaming classes, enums and other items internal to the Drill source
> code.
>
> 2. Renaming classes that are part of a public or ad-hoc API.
>
> 3. Renaming items visible to users.
>
> Changing items in part 1 causes a one-time disruption to anyone who has a
> working branch. However, a rebase onto master would easily resolve any
> issues. So, changes in this group are pretty safe.
>
>
> The PR also seems to change symbols visible to the anyone who has code in
> a repo separate from Drill, but that builds against Drill. All UDFs and
> plugins that use the former map classes must change. This means that those
> contributions can support only Drill before your PR or after; the
> maintainer would need two separate branches to support both versions of
> Drill.
>
> Such breaking of (implied) API compatibility is often considered a "bad
> thing." We may not want to complicate the lives of those who have
> graciously created Drill extensions and integrations.
>
> Finally, if we change anything visible from SqlLine, we break running
> applications, which we almost certainly do not want to do. See the changes
> to Types.java as an example.
>
> Can you make the change in a way that all your changes fall only into
> group 1, provide a gradual migration for group 2, and do not change
> anything in group 3?
>
> For example; the MinorType enum is a de-facto public API and must retain
> MAP with its current meaning, at least for some number of releases. You
> could add a STRUCT enum and mark MAP deprecated, and encourage third-party
> code to migrate. But we must still support MAP for some period of time to
> provide time for the migration. Then, add the new "map" as, say KVMAP,
> TRUEMAP, KVPAIRS, HIVEMAP, MAP2 or whatever. (Awkward, yes, but necessary.)
> In the future, when the old MAP enum value is retired, it can be repurposed
> as an alias for KVMAP (or whatever), and the KVMAP enum marked as
> deprecated, to be removed after several more releases.
>
>
> Similarly the SQL "MAP" type keyword cannot change, nor can the name of
> any SQL function (UDF) that use the "map" term. These changes will break
> SQL created by users which generally does not end well. Again, you can add
> a new alias, and encourage use of that alias.
>
> One could certainly argue that making a breaking change will impact a
> limited number of people, and that the benefit justifies the cost. I'll
> leave that debate to others, focusing here on the mechanics.
>
>
> Thanks,
> - Paul
>
>
>
> On Friday, May 31, 2019, 12:06:35 AM PDT, Igor Guzenko <
> ihor.huzenko@gmail.com> wrote:
>
>  Hello Drillers,
>
> I'm working on the renaming of Map vector[1] and related stuff to make
> space for new canonical Map vector [2] [3]. I believe this
> renaming causes big impact on Drill and related client's code
> (ODBC/JDBC).
>
> So I'd like to be sure that this renaming is really necessary and
> everybody agrees with the changes. Please check the draft PR [4] and
> reply on the email.
>
> Alternative solution is simply leave current map vector as is and name
> newly created Map vector (+readers, writers etc.)  differently.
>
> [1] https://issues.apache.org/jira/browse/DRILL-7097
> [2] https://issues.apache.org/jira/browse/DRILL-7096
> [3]
> https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
> [4] https://github.com/apache/drill/pull/1803
>
> Thanks, Igor Guzenko
>

Re: adding insert

2019-05-28 Thread Ted Dunning

Yes. CTAS should be a similar problem to unsafe inserts.

We have a few people interested in the work. What is needed more is
pointers to where to find out about the details.

1. How can we enable the syntax?

2. What operators are really necessary?

3. How should writers inject insert optimizer rules to allow insert or
update operator pushdown?

On Mon, May 27, 2019 at 9:42 PM Paul Rogers 
wrote:

> Hi Ted,
>
> Drill can do a CTAS today, which uses a writer provided by the format
> plugin. One would think this same structure could work for an INSERT
> operation, with a writer provided by the storage plugin. The devil, of
> course, is always in the details. And in finding resources to do the work...
>
> Thanks,
> - Paul
>
>
>
> On Monday, May 27, 2019, 5:28:27 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  I have in mind the ability to push rows to an underlying DB without any
> transactional support.
>
>
>
>
>

Re: adding insert

2019-05-27 Thread Ted Dunning

And I should point out that Drill already has the problem of data that
changes. It just ignores the problem. If somebody appends to one CSV or
JSON file or another, some changes might get picked up, some might be seen
mid-change (causing a data syntax error, possibly) or if DB rows are
inserted then Drill will give strange results.

The Drill policy is and always has been "that's tough".

I am proposing to extend that policy by letting Drill join the party of
tools that do updates.

In particular, I want to send row updates or row inserts to MapR DB.

I have watched the Hive/ORC transactional insert train wreck for some time.
I think that the only viable lesson from that is that 1) doing transactions
on top of a non-database is hard and 2) having non-database people do it
makes it even harder.

My own feeling is that until some more serious work is done on this that
the right solution is to get some simple capabilities in place. For
instance, if we have insert-only semantics, tracking insertion transactions
using a job_id (or window_id) field works a treat, especially if you hide
the probe for pending or aborted inserts using a view. This actually works,
works well, and is incredibly simple. The only thing wrong is that I have
to bring in a separate tool like spark or python to do the insertions. With
sleazyInsert, I could do it all with Drill plus a tiny bit of scripting
glue.

On Mon, May 27, 2019 at 5:27 PM Ted Dunning  wrote:

>
> I have in mind the ability to push rows to an underlying DB without any
> transactional support.
>
>
>
> On Mon, May 27, 2019 at 2:16 PM Paul Rogers 
> wrote:
>
>> Hi Ted,
>>
>> From item 3, it should like you are focusing on using Drill to front a DB
>> system, rather than proposing to use Drill to update files in a distributed
>> file system (DFS).
>>
>>
>> Turns out that, for the DFS case, the former HortonWorks put quite a bit
>> into working out viable insert/update semantics in Hive with the Hive ACID
>> support. [1], [2] This was a huge amount of work done in conjunction with
>> various partners, and is on its third version as Hive learns the semantics
>> and how to get ACID to perform well under load. Adding ACID support to
>> Drill would be a "non-trivial" exercise (unless Drill could actually borrow
>> Hive's code, but even that might not be simple.)
>>
>>
>> Drill is far simpler than Hive because Drill has long exploited the fact
>> that data is read-only. Once data can change, we must revisit various
>> aspects to account for that fact. Since change can occur concurrently with
>> queries (and other changes), some kind of concurrency control is needed.
>> Hive has worked out a way to ensure that only completed transactions are
>> included in a query by using delta files. Hive delta files can include
>> inserts, updates and deletes.
>>
>> If insert is all that is needed, then there may be simpler solutions:
>> just track which files are newly added. If the underlying file system is
>> atomic, then even this can be simplified down to just noticing that a file
>> exist when planning a query. If the file is visible before it is complete,
>> then some form of mechanism is needed to detect in-progress files. Of
>> course, Drill must already handle this case for files created outside of
>> Drill, so it may "just work" for the DFS case.
>>
>>
>> And, if the goal is simply to push insert into a DB, then the DB itself
>> can handle transactions and concurrency. Generally most DBs manage
>> transaction as part of a session. To ensure Drill does a consistent insert,
>> Drill would need to push the update though a single client (single minor
>> fragment). A distributed insert (using multiple minor fragments each
>> inserting a subset of rows) would require two-phase commit, or would have
>> to forgo consistency. (The CAP problem.) Further, Drill would have to
>> handle insert failures (deadlock detection, duplicate keys, etc.) reported
>> by the target DB and return that error to the Drill client (hopefully in a
>> form other than a long Java stack trace...)
>>
>> All this said, I suspect you have in mind a specific use case that is far
>> simpler than the general case. Can you explain more a bit what you have in
>> mind?
>>
>> Thanks,
>> - Paul
>>
>> [1]
>> https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
>> [2]
>> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html
>>
>>
>>
>>
>>
>> On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning <
>> ted.dunn..

Re: adding insert

2019-05-27 Thread Ted Dunning

I have in mind the ability to push rows to an underlying DB without any
transactional support.



On Mon, May 27, 2019 at 2:16 PM Paul Rogers 
wrote:

> Hi Ted,
>
> From item 3, it should like you are focusing on using Drill to front a DB
> system, rather than proposing to use Drill to update files in a distributed
> file system (DFS).
>
>
> Turns out that, for the DFS case, the former HortonWorks put quite a bit
> into working out viable insert/update semantics in Hive with the Hive ACID
> support. [1], [2] This was a huge amount of work done in conjunction with
> various partners, and is on its third version as Hive learns the semantics
> and how to get ACID to perform well under load. Adding ACID support to
> Drill would be a "non-trivial" exercise (unless Drill could actually borrow
> Hive's code, but even that might not be simple.)
>
>
> Drill is far simpler than Hive because Drill has long exploited the fact
> that data is read-only. Once data can change, we must revisit various
> aspects to account for that fact. Since change can occur concurrently with
> queries (and other changes), some kind of concurrency control is needed.
> Hive has worked out a way to ensure that only completed transactions are
> included in a query by using delta files. Hive delta files can include
> inserts, updates and deletes.
>
> If insert is all that is needed, then there may be simpler solutions: just
> track which files are newly added. If the underlying file system is atomic,
> then even this can be simplified down to just noticing that a file exist
> when planning a query. If the file is visible before it is complete, then
> some form of mechanism is needed to detect in-progress files. Of course,
> Drill must already handle this case for files created outside of Drill, so
> it may "just work" for the DFS case.
>
>
> And, if the goal is simply to push insert into a DB, then the DB itself
> can handle transactions and concurrency. Generally most DBs manage
> transaction as part of a session. To ensure Drill does a consistent insert,
> Drill would need to push the update though a single client (single minor
> fragment). A distributed insert (using multiple minor fragments each
> inserting a subset of rows) would require two-phase commit, or would have
> to forgo consistency. (The CAP problem.) Further, Drill would have to
> handle insert failures (deadlock detection, duplicate keys, etc.) reported
> by the target DB and return that error to the Drill client (hopefully in a
> form other than a long Java stack trace...)
>
> All this said, I suspect you have in mind a specific use case that is far
> simpler than the general case. Can you explain more a bit what you have in
> mind?
>
> Thanks,
> - Paul
>
> [1]
> https://hortonworks.com/tutorial/using-hive-acid-transactions-to-insert-update-and-delete-data/
> [2]
> https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/using-hiveql/content/hive_3_internals.html
>
>
>
>
>
> On Monday, May 27, 2019, 1:15:36 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  I would like to start a discussion about how to add insert capabilities to
> drill.
>
> It seems that the basic outline is:
>
> 1) making sure Calcite will parse it (almost certain)
> 2) defining an upsert operator in the logical plan
> 3) push rules into Drill from the DB driver to allow Drill to push down the
> upsert into DB
>
> Are these generally correct?
>
> Can anybody point me to analogous operations?
>

adding insert

2019-05-27 Thread Ted Dunning

I would like to start a discussion about how to add insert capabilities to
drill.

It seems that the basic outline is:

1) making sure Calcite will parse it (almost certain)
2) defining an upsert operator in the logical plan
3) push rules into Drill from the DB driver to allow Drill to push down the
upsert into DB

Are these generally correct?

Can anybody point me to analogous operations?

1 2 3 >

1 - 100 of 277 matches

Mail list logo