No Drill hangout next Tuesday 19th Sept
Drillers, Due to developers attending the hackathon, we won't be having the Drill hangout next Tuesday. Next one will be on Tuesday Oct 3rd. See you then ! -Aman
[jira] [Created] (DRILL-5797) Use more often the new parquet reader
Damien Profeta created DRILL-5797: - Summary: Use more often the new parquet reader Key: DRILL-5797 URL: https://issues.apache.org/jira/browse/DRILL-5797 Project: Apache Drill Issue Type: Improvement Components: Storage - Parquet Reporter: Damien Profeta The choice of using the regular parquet reader of the optimized one is based of what type of columns is in the file. But the columns that are read by the query doesn't matter. We can increase a little bit the cases where the optimized reader is used by checking is the projected column are simple or not. This is an optimization waiting for the fast parquet reader to handle complex structure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DRILL-5796) Filter pruning for multi rowgroup parquet file
Damien Profeta created DRILL-5796: - Summary: Filter pruning for multi rowgroup parquet file Key: DRILL-5796 URL: https://issues.apache.org/jira/browse/DRILL-5796 Project: Apache Drill Issue Type: Improvement Components: Storage - Parquet Reporter: Damien Profeta Today, filter pruning use the file name as the partitioning key. This means you can remove a partition only if the whole file is for the same partition. With parquet, you can prune the filter if the rowgroup make a partition of your dataset as the unit of work if the rowgroup not the file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file
Damien Profeta created DRILL-5795: - Summary: Filter pushdown for parquet handles multi rowgroup file Key: DRILL-5795 URL: https://issues.apache.org/jira/browse/DRILL-5795 Project: Apache Drill Issue Type: Improvement Components: Storage - Parquet Reporter: Damien Profeta DRILL-1950 implemented the filter pushdown for parquet file but only in the case of one rowgroup per parquet file. In the case of multiple rowgroups per files, it detects that the rowgroup can be pruned but then tell to the drillbit to read the whole file which leads to performance issue. Having multiple rowgroup per file helps to handle partitioned dataset and still read only the relevant subset of data without ending with more file than really needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] drill pull request #944: DRILL-5425: Support HTTP Kerberos auth using SPNEGO
GitHub user sindhurirayavaram opened a pull request: https://github.com/apache/drill/pull/944 DRILL-5425: Support HTTP Kerberos auth using SPNEGO SPNEGO extends kerberos authentication to Drill WEB UI. Things to be added -Unit Tests -Showing the login option depending on the configured mechanisms You can merge this pull request into a Git repository by running: $ git pull https://github.com/sindhurirayavaram/drill Spnego Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/944.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #944 commit c770a2e7b0e75c4f18b954d8ada36469f00c8355 Author: maprDate: 2017-09-11T23:56:22Z Added comments commit 6015e98f08c1961bbbc9911946522f9a076e87e0 Author: mapr Date: 2017-09-13T19:48:28Z Added Test Cases commit 5ff12ba1fe38fd728b0c33a686c665547dd21c50 Author: Sindhuri Date: 2017-09-15T22:21:53Z Formatted code ---
Re: Drill 2.0 (design) hackathon
Hi Pritesh, What time do you think you’d want me to present? Also, should I make some slides? Best, — C > On Sep 15, 2017, at 13:23, Pritesh Makerwrote: > > Hi All > > We are looking forward to hosting the hackathon on Monday. Just a few updates > on the logistics and agenda > > • We are expecting over 25 people attending the event – you can see the > attendee list at the Eventbrite site - > https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285 > > > • Breakfast will be served starting at 8:30AM – we would like to begin > promptly at 9AM > > • The agenda has been updated to reflect the speakers (see the update in the > sheet - > https://docs.google.com/spreadsheets/d/1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 > ) > o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha > o Community Contributions – Anil Kumar, John Omernik, Charles Givre and Ted > Dunning > o Two tracks for technical design discussions – some topics have initial > thoughts for the topics and some will have open brainstorming discussions > o Once the discussions are concluded, we will have summaries presented and > notes shared with the community > > • We will have a WebEx for the first two sessions. For the two tracks, we > will either continue the WebEx or have Hangout links (will publish them to > the google sheet) > "JOIN WEBEX MEETING > https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76 > Meeting number (access code): 806 111 950 > Meeting password: ApacheDrill" > > • For the attendees in person, we have made bookings for a dinner in the > evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas > > Looking forward to a fantastic day for the Apache Drill! community! > > Thanks, > Pritesh > > > > On 9/5/17, 10:47 PM, "Aman Sinha" wrote: > >Here is the Eventbrite event for registration: > > > https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285 > >Please register so we can plan for food and drinks appropriately. > >The link also contains a google doc link for the preliminary agenda and a >'Topics' tab with volunteer sign-up column. Please add your name to the >area(s) of interest. > >Thanks and look forward to seeing you all ! > >-Aman > >On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers wrote: > >> A partial list of Drill’s public APIs: >> >> IMHO, highest priority for Drill 2.0. >> >> >> * JDBC/ODBC drivers >> * Client (for JDBC/ODBC) + ODBC & JDBC >> * Client (for full Drill async, columnar) >> * Storage plugin >> * Format plugin >> * System/session options >> * Queueing (e.g. ZK-based queues) >> * Rest API >> * Resource Planning (e.g. max query memory per node) >> * Metadata access, storage (e.g. file system locations vs. a metastore) >> * Metadata files formats (Parquet, views, etc.) >> >> Lower priority for future releases: >> >> >> * Query Planning (e.g. Calcite rules) >> * Config options >> * SQL syntax, especially Drill extensions >> * UDF >> * Management (e.g. JMX, Rest API calls, etc.) >> * Drill File System (HDFS) >> * Web UI >> * Shell scripts >> >> There are certainly more. Please suggest those that are missing. I’ve >> taken a rough cut at which APIs need forward/backward compatibility first, >> in part based on those that are the “most public” and most likely to >> change. Others are important, but we can’t do them all at once. >> >> Thanks, >> >> - Paul >> >> On Aug 29, 2017, at 6:00 PM, Aman Sinha mansi...@apache.org>> wrote: >> >> Hi Paul, >> certainly makes sense to have the API compatibility discussions during this >> hackathon. The 2.0 release may be a good checkpoint to introduce breaking >> changes necessitating changes to the ODBC/JDBC drivers and other external >> applications. As part of this exercise (not during the hackathon but as a >> follow-up action), we also should clearly identify the "public" interfaces. >> >> >> I will add this to the agenda. >> >> thanks, >> -Aman >> >> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers prog...@mapr.com>> wrote: >> >> Thanks Aman for organizing the Hackathon! >> >> The list included many good ideas for Drill 2.0. Some of those require >> changes to Drill’s “public” interfaces (file format, client protocol, SQL >> behavior, etc.) >> >> At present, Drill has no good mechanism to handle backward/forward >> compatibility at the API level. Protobuf versioning certainly helps, but >> can’t completely solve semantic changes (where a field changes meaning, or >> a non-Protobuf data chunk changes format.) As just one concrete example, >> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class >> names and data formats will change. >> >> Perhaps we can prioritize, for the proposed 2.0 release, a
Re: Code Generation Question
We had a quick discussion. There is some doubt that Java can correctly optimize code that uses subclasses. The problem is that, for one query, the JIT wants to optimize the code one way, for another, the JIT wants to optimize a different way. By having copies of the byte codes, the JIT can optimize code differently for different queries. Of course, there is quite a bit of code that is common. Should we copy that code for each operator also? Only testing will reveal the best path forward. For now, there is enough FUD (fear, uncertainty and doubt) that we should leave things as they are for production. (Feel free to use plain Java in development.) Thanks, - Paul > On Sep 15, 2017, at 1:21 PM, Boaz Ben-Zviwrote: > > Hi Tim, > > The latest Pull Request for the Hash Aggr operator (#938) does turn the > “plain java” on for the mainline code, as these new template code changes (in > the Hash Table) caused the “byte twiddling” to break in some subtle way. > This is the first attempt; and as it (hopefully) will work well we’ll > continue with other operators. > >Thanks, > >Boaz > > On 9/15/17, 11:34 AM, "Paul Rogers" wrote: > >Hi Tim, > >This question has come up multiple times. The “plain Java” is very handy > for developing code with code generation. It also seems to be faster, smaller > and simpler than the byte-code-merge mechanism. (However, rewriting byte > codes is has the benefit of sounding much more sophisticated than simply > invoking the Java compiler!) > >I suspect that there is a healthy concern that there may be subtle > problems with letting Java compile code without our help in twiddling with > the byte codes. > >The way to address this concern is to run a full set of functional and > performance tests with the “plan Java” mechanism turned on. But, no one has > had the time to do that… > >That said, feel free to turn “plain Java" on during development; we now > have sufficient experience to show that Java does, at least in development, > produce code at least as good as what we produce via our byte-code merge > mechanisms. With the added benefit that you can debug the code. (By contrast, > when using the byte-code merge approach, there is no matching source code for > the debugger to step through… I believe that folks have, instead, used print > statements to visualize the execution flow.) > >Thanks, > >- Paul > >> On Sep 14, 2017, at 1:41 PM, Timothy Farkas wrote: >> >> Hi All, >> >> As I've been looking at the TopN operator and code generation, I've been >> wondering why we have 2 forms of code generation: >> >> >> * One is the method of stitching compiled methods into a template class >> with ASM. >> * The other simply creates a class that extends the TemplateClass and >> compiles it without using custom ASM techniques. This is the PlainJava >> technique. >> >> With my high level understanding, it seems like using the PlainJava approach >> would be the simplest, and would also probably be the most performant since >> we inherit all the java compiler optimizations. Is there a specific reason >> why we still use our custom ASM technique? Would it be safe to start >> retiring the old ASM technique in favor of PlainJava? >> >> Thanks, >> Tim > > >
Re: Code Generation Question
Hi Tim, The latest Pull Request for the Hash Aggr operator (#938) does turn the “plain java” on for the mainline code, as these new template code changes (in the Hash Table) caused the “byte twiddling” to break in some subtle way. This is the first attempt; and as it (hopefully) will work well we’ll continue with other operators. Thanks, Boaz On 9/15/17, 11:34 AM, "Paul Rogers"wrote: Hi Tim, This question has come up multiple times. The “plain Java” is very handy for developing code with code generation. It also seems to be faster, smaller and simpler than the byte-code-merge mechanism. (However, rewriting byte codes is has the benefit of sounding much more sophisticated than simply invoking the Java compiler!) I suspect that there is a healthy concern that there may be subtle problems with letting Java compile code without our help in twiddling with the byte codes. The way to address this concern is to run a full set of functional and performance tests with the “plan Java” mechanism turned on. But, no one has had the time to do that… That said, feel free to turn “plain Java" on during development; we now have sufficient experience to show that Java does, at least in development, produce code at least as good as what we produce via our byte-code merge mechanisms. With the added benefit that you can debug the code. (By contrast, when using the byte-code merge approach, there is no matching source code for the debugger to step through… I believe that folks have, instead, used print statements to visualize the execution flow.) Thanks, - Paul > On Sep 14, 2017, at 1:41 PM, Timothy Farkas wrote: > > Hi All, > > As I've been looking at the TopN operator and code generation, I've been wondering why we have 2 forms of code generation: > > > * One is the method of stitching compiled methods into a template class with ASM. > * The other simply creates a class that extends the TemplateClass and compiles it without using custom ASM techniques. This is the PlainJava technique. > > With my high level understanding, it seems like using the PlainJava approach would be the simplest, and would also probably be the most performant since we inherit all the java compiler optimizations. Is there a specific reason why we still use our custom ASM technique? Would it be safe to start retiring the old ASM technique in favor of PlainJava? > > Thanks, > Tim
Re: Code Generation Question
Hi Tim, This question has come up multiple times. The “plain Java” is very handy for developing code with code generation. It also seems to be faster, smaller and simpler than the byte-code-merge mechanism. (However, rewriting byte codes is has the benefit of sounding much more sophisticated than simply invoking the Java compiler!) I suspect that there is a healthy concern that there may be subtle problems with letting Java compile code without our help in twiddling with the byte codes. The way to address this concern is to run a full set of functional and performance tests with the “plan Java” mechanism turned on. But, no one has had the time to do that… That said, feel free to turn “plain Java" on during development; we now have sufficient experience to show that Java does, at least in development, produce code at least as good as what we produce via our byte-code merge mechanisms. With the added benefit that you can debug the code. (By contrast, when using the byte-code merge approach, there is no matching source code for the debugger to step through… I believe that folks have, instead, used print statements to visualize the execution flow.) Thanks, - Paul > On Sep 14, 2017, at 1:41 PM, Timothy Farkaswrote: > > Hi All, > > As I've been looking at the TopN operator and code generation, I've been > wondering why we have 2 forms of code generation: > > > * One is the method of stitching compiled methods into a template class > with ASM. > * The other simply creates a class that extends the TemplateClass and > compiles it without using custom ASM techniques. This is the PlainJava > technique. > > With my high level understanding, it seems like using the PlainJava approach > would be the simplest, and would also probably be the most performant since > we inherit all the java compiler optimizations. Is there a specific reason > why we still use our custom ASM technique? Would it be safe to start retiring > the old ASM technique in favor of PlainJava? > > Thanks, > Tim
[jira] [Resolved] (DRILL-5724) Scan on a local directory containing multiple text files (one or more empty) throws FileNotFoundException
[ https://issues.apache.org/jira/browse/DRILL-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Nagaraj Subramanya resolved DRILL-5724. -- Resolution: Cannot Reproduce > Scan on a local directory containing multiple text files (one or more empty) > throws FileNotFoundException > - > > Key: DRILL-5724 > URL: https://issues.apache.org/jira/browse/DRILL-5724 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Text & CSV >Affects Versions: 1.11.0 >Reporter: Prasad Nagaraj Subramanya > > 1) Create a directory having multiple text files (one or more empty) > 2) Do a scan on the directory > {code} > select * from lfs.`/home/user/dir1`; > {code} > The query throws the below error- > {code} > Error: SYSTEM ERROR: FileNotFoundException: File > file:///home/user/dir1/ does not exist > Setup failed for CompliantTextRecordReader > Fragment 1:2 > {code} > Issue reproducible with - csv, tsv and psv files -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: Drill 2.0 (design) hackathon
Hi All We are looking forward to hosting the hackathon on Monday. Just a few updates on the logistics and agenda • We are expecting over 25 people attending the event – you can see the attendee list at the Eventbrite site - https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285 • Breakfast will be served starting at 8:30AM – we would like to begin promptly at 9AM • The agenda has been updated to reflect the speakers (see the update in the sheet - https://docs.google.com/spreadsheets/d/1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0 ) o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha o Community Contributions – Anil Kumar, John Omernik, Charles Givre and Ted Dunning o Two tracks for technical design discussions – some topics have initial thoughts for the topics and some will have open brainstorming discussions o Once the discussions are concluded, we will have summaries presented and notes shared with the community • We will have a WebEx for the first two sessions. For the two tracks, we will either continue the WebEx or have Hangout links (will publish them to the google sheet) "JOIN WEBEX MEETING https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76 Meeting number (access code): 806 111 950 Meeting password: ApacheDrill" • For the attendees in person, we have made bookings for a dinner in the evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas Looking forward to a fantastic day for the Apache Drill! community! Thanks, Pritesh On 9/5/17, 10:47 PM, "Aman Sinha"wrote: Here is the Eventbrite event for registration: https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285 Please register so we can plan for food and drinks appropriately. The link also contains a google doc link for the preliminary agenda and a 'Topics' tab with volunteer sign-up column. Please add your name to the area(s) of interest. Thanks and look forward to seeing you all ! -Aman On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers wrote: > A partial list of Drill’s public APIs: > > IMHO, highest priority for Drill 2.0. > > > * JDBC/ODBC drivers > * Client (for JDBC/ODBC) + ODBC & JDBC > * Client (for full Drill async, columnar) > * Storage plugin > * Format plugin > * System/session options > * Queueing (e.g. ZK-based queues) > * Rest API > * Resource Planning (e.g. max query memory per node) > * Metadata access, storage (e.g. file system locations vs. a metastore) > * Metadata files formats (Parquet, views, etc.) > > Lower priority for future releases: > > > * Query Planning (e.g. Calcite rules) > * Config options > * SQL syntax, especially Drill extensions > * UDF > * Management (e.g. JMX, Rest API calls, etc.) > * Drill File System (HDFS) > * Web UI > * Shell scripts > > There are certainly more. Please suggest those that are missing. I’ve > taken a rough cut at which APIs need forward/backward compatibility first, > in part based on those that are the “most public” and most likely to > change. Others are important, but we can’t do them all at once. > > Thanks, > > - Paul > > On Aug 29, 2017, at 6:00 PM, Aman Sinha > wrote: > > Hi Paul, > certainly makes sense to have the API compatibility discussions during this > hackathon. The 2.0 release may be a good checkpoint to introduce breaking > changes necessitating changes to the ODBC/JDBC drivers and other external > applications. As part of this exercise (not during the hackathon but as a > follow-up action), we also should clearly identify the "public" interfaces. > > > I will add this to the agenda. > > thanks, > -Aman > > On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers > wrote: > > Thanks Aman for organizing the Hackathon! > > The list included many good ideas for Drill 2.0. Some of those require > changes to Drill’s “public” interfaces (file format, client protocol, SQL > behavior, etc.) > > At present, Drill has no good mechanism to handle backward/forward > compatibility at the API level. Protobuf versioning certainly helps, but > can’t completely solve semantic changes (where a field changes meaning, or > a non-Protobuf data chunk changes format.) As just one concrete example, > changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class > names and data formats will change. > > Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of > breaking changes that introduce a
[GitHub] drill issue #889: DRILL-5691: enhance scalar sub queries checking for the ca...
Github user weijietong commented on the issue: https://github.com/apache/drill/pull/889 @arina-ielchiieva @amansinha100 any further advice ? ---
Resolving object type using ObjectVector
Hi Team, I am trying to implement a storage plugin for my database (it supports Postgres JDBC driver) and can store compound objects. I am able to parse primary objects in my RecordReader, but for object types: I create Copier of ObjectVector type, and have overridden it's copy method: private class JavaObjectCopier extends Copier { public JavaObjectCopier(int columnIndex, ResultSet result, Mutator mutator) { super(columnIndex, result, mutator); } @Override void copy(int index) throws SQLException { *// this object is of type java.util.HashMap* *Object object = result.getObject(columnIndex);* } The mutator.setSafe method accepts a long value or an object holder, so i am trying to set my object into the holder's obj field, and passing the holder to the method. I also tried calling the set method directly, and other brute-force ways, but it throws exception: > Caused by: java.lang.UnsupportedOperationException: ObjectVector does not > support this > at > org.apache.drill.exec.vector.ObjectVector.makeTransferPair(ObjectVector.java:159) > ~[vector-1.11.0.jar:1.11.0] > at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch. > setupNewSchema(ProjectRecordBatch.java:441) ~[drill-java-exec-1.11.0.jar: > 1.11.0] Can anybody please point out if am missing anything? Thanks, Charuta