No Drill hangout next Tuesday 19th Sept

2017-09-15 Thread Aman Sinha
Drillers,
Due to developers attending the hackathon, we won't be having the Drill
hangout next Tuesday. Next one will be on Tuesday Oct 3rd.

See you then !
-Aman


[jira] [Created] (DRILL-5797) Use more often the new parquet reader

2017-09-15 Thread Damien Profeta (JIRA)
Damien Profeta created DRILL-5797:
-

 Summary: Use more often the new parquet reader
 Key: DRILL-5797
 URL: https://issues.apache.org/jira/browse/DRILL-5797
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Reporter: Damien Profeta


The choice of using the regular parquet reader of the optimized one is based of 
what type of columns is in the file. But the columns that are read by the query 
doesn't matter. We can increase a little bit the cases where the optimized 
reader is used by checking is the projected column are simple or not.
This is an optimization waiting for the fast parquet reader to handle complex 
structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-5796) Filter pruning for multi rowgroup parquet file

2017-09-15 Thread Damien Profeta (JIRA)
Damien Profeta created DRILL-5796:
-

 Summary: Filter pruning for multi rowgroup parquet file
 Key: DRILL-5796
 URL: https://issues.apache.org/jira/browse/DRILL-5796
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Reporter: Damien Profeta


Today, filter pruning use the file name as the partitioning key. This means you 
can remove a partition only if the whole file is for the same partition. With 
parquet, you can prune the filter if the rowgroup make a partition of your 
dataset as the unit of work if the rowgroup not the file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-5795) Filter pushdown for parquet handles multi rowgroup file

2017-09-15 Thread Damien Profeta (JIRA)
Damien Profeta created DRILL-5795:
-

 Summary: Filter pushdown for parquet handles multi rowgroup file
 Key: DRILL-5795
 URL: https://issues.apache.org/jira/browse/DRILL-5795
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Reporter: Damien Profeta


DRILL-1950 implemented the filter pushdown for parquet file but only in the 
case of one rowgroup per parquet file. In the case of multiple rowgroups per 
files, it detects that the rowgroup can be pruned but then tell to the drillbit 
to read the whole file which leads to performance issue.

Having multiple rowgroup per file helps to handle partitioned dataset and still 
read only the relevant subset of data without ending with more file than really 
needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] drill pull request #944: DRILL-5425: Support HTTP Kerberos auth using SPNEGO

2017-09-15 Thread sindhurirayavaram
GitHub user sindhurirayavaram opened a pull request:

https://github.com/apache/drill/pull/944

DRILL-5425: Support HTTP Kerberos auth using SPNEGO

SPNEGO extends kerberos authentication to Drill WEB UI. 

Things to be added 
-Unit Tests
-Showing the login option depending on the configured mechanisms


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sindhurirayavaram/drill Spnego

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #944


commit c770a2e7b0e75c4f18b954d8ada36469f00c8355
Author: mapr 
Date:   2017-09-11T23:56:22Z

Added comments

commit 6015e98f08c1961bbbc9911946522f9a076e87e0
Author: mapr 
Date:   2017-09-13T19:48:28Z

Added Test Cases

commit 5ff12ba1fe38fd728b0c33a686c665547dd21c50
Author: Sindhuri 
Date:   2017-09-15T22:21:53Z

Formatted code




---


Re: Drill 2.0 (design) hackathon

2017-09-15 Thread Charles Givre
Hi Pritesh, 
What time do you think you’d want me to present?  Also, should I make some 
slides?  
Best,
— C

> On Sep 15, 2017, at 13:23, Pritesh Maker  wrote:
> 
> Hi All
> 
> We are looking forward to hosting the hackathon on Monday. Just a few updates 
> on the logistics and agenda
> 
> • We are expecting over 25 people attending the event – you can see the 
> attendee list at the Eventbrite site -  
> https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285
>  
> 
> • Breakfast will be served starting at 8:30AM – we would like to begin 
> promptly at 9AM 
> 
> • The agenda has been updated to reflect the speakers (see the update in the 
> sheet - 
> https://docs.google.com/spreadsheets/d/1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0
>  )
> o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha 
> o Community Contributions – Anil Kumar, John Omernik, Charles Givre and Ted 
> Dunning 
> o Two tracks for technical design discussions – some topics have initial 
> thoughts for the topics and some will have open brainstorming discussions
> o Once the discussions are concluded, we will have summaries presented and 
> notes shared with the community
> 
> • We will have a WebEx for the first two sessions. For the two tracks, we 
> will either continue the WebEx or have Hangout links (will publish them to 
> the google sheet)
> "JOIN WEBEX MEETING
> https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76
> Meeting number (access code): 806 111 950
> Meeting password: ApacheDrill"
> 
> • For the attendees in person, we have made bookings for a dinner in the 
> evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas 
> 
> Looking forward to a fantastic day for the Apache Drill! community!
> 
> Thanks,
> Pritesh
> 
> 
> 
> On 9/5/17, 10:47 PM, "Aman Sinha"  wrote:
> 
>Here is the Eventbrite event for registration:
> 
>
> https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285
> 
>Please register so we can plan for food and drinks appropriately.
> 
>The link also contains a google doc link for the preliminary agenda and a
>'Topics' tab with volunteer sign-up column.  Please add your name to the
>area(s) of interest.
> 
>Thanks and look forward to seeing you all !
> 
>-Aman
> 
>On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers  wrote:
> 
>> A partial list of Drill’s public APIs:
>> 
>> IMHO, highest priority for Drill 2.0.
>> 
>> 
>>  *   JDBC/ODBC drivers
>>  *   Client (for JDBC/ODBC) + ODBC & JDBC
>>  *   Client (for full Drill async, columnar)
>>  *   Storage plugin
>>  *   Format plugin
>>  *   System/session options
>>  *   Queueing (e.g. ZK-based queues)
>>  *   Rest API
>>  *   Resource Planning (e.g. max query memory per node)
>>  *   Metadata access, storage (e.g. file system locations vs. a metastore)
>>  *   Metadata files formats (Parquet, views, etc.)
>> 
>> Lower priority for future releases:
>> 
>> 
>>  *   Query Planning (e.g. Calcite rules)
>>  *   Config options
>>  *   SQL syntax, especially Drill extensions
>>  *   UDF
>>  *   Management (e.g. JMX, Rest API calls, etc.)
>>  *   Drill File System (HDFS)
>>  *   Web UI
>>  *   Shell scripts
>> 
>> There are certainly more. Please suggest those that are missing. I’ve
>> taken a rough cut at which APIs need forward/backward compatibility first,
>> in part based on those that are the “most public” and most likely to
>> change. Others are important, but we can’t do them all at once.
>> 
>> Thanks,
>> 
>> - Paul
>> 
>> On Aug 29, 2017, at 6:00 PM, Aman Sinha  mansi...@apache.org>> wrote:
>> 
>> Hi Paul,
>> certainly makes sense to have the API compatibility discussions during this
>> hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
>> changes necessitating changes to the ODBC/JDBC drivers and other external
>> applications. As part of this exercise (not during the hackathon but as a
>> follow-up action), we also should clearly identify the "public" interfaces.
>> 
>> 
>> I will add this to the agenda.
>> 
>> thanks,
>> -Aman
>> 
>> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers  prog...@mapr.com>> wrote:
>> 
>> Thanks Aman for organizing the Hackathon!
>> 
>> The list included many good ideas for Drill 2.0. Some of those require
>> changes to Drill’s “public” interfaces (file format, client protocol, SQL
>> behavior, etc.)
>> 
>> At present, Drill has no good mechanism to handle backward/forward
>> compatibility at the API level. Protobuf versioning certainly helps, but
>> can’t completely solve semantic changes (where a field changes meaning, or
>> a non-Protobuf data chunk changes format.) As just one concrete example,
>> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
>> names and data formats will change.
>> 
>> Perhaps we can prioritize, for the proposed 2.0 release, a 

Re: Code Generation Question

2017-09-15 Thread Paul Rogers
We had a quick discussion. There is some doubt that Java can correctly optimize 
code that uses subclasses. The problem is that, for one query, the JIT wants to 
optimize the code one way, for another, the JIT wants to optimize a different 
way. By having copies of the byte codes, the JIT can optimize code differently 
for different queries.

Of course, there is quite a bit of code that is common. Should we copy that 
code for each operator also?

Only testing will reveal the best path forward. For now, there is enough FUD 
(fear, uncertainty and doubt) that we should leave things as they are for 
production. (Feel free to use plain Java in development.)

Thanks,

- Paul

> On Sep 15, 2017, at 1:21 PM, Boaz Ben-Zvi  wrote:
> 
> Hi Tim,
> 
> The latest Pull Request for the Hash Aggr operator (#938) does turn the 
> “plain java” on for the mainline code, as these new template code changes (in 
> the Hash Table) caused the “byte twiddling” to break in some subtle way.
> This is the first attempt; and as it (hopefully) will work well we’ll 
> continue with other operators.
> 
>Thanks,
> 
>Boaz 
> 
> On 9/15/17, 11:34 AM, "Paul Rogers"  wrote:
> 
>Hi Tim,
> 
>This question has come up multiple times. The “plain Java” is very handy 
> for developing code with code generation. It also seems to be faster, smaller 
> and simpler than the byte-code-merge mechanism. (However, rewriting byte 
> codes is has the benefit of sounding much more sophisticated than simply 
> invoking the Java compiler!)
> 
>I suspect that there is a healthy concern that there may be subtle 
> problems with letting Java compile code without our help in twiddling with 
> the byte codes.
> 
>The way to address this concern is to run a full set of functional and 
> performance tests with the “plan Java” mechanism turned on. But, no one has 
> had the time to do that…
> 
>That said, feel free to turn “plain Java" on during development; we now 
> have sufficient experience to show that Java does, at least in development, 
> produce code at least as good as what we produce via our byte-code merge 
> mechanisms. With the added benefit that you can debug the code. (By contrast, 
> when using the byte-code merge approach, there is no matching source code for 
> the debugger to step through… I believe that folks have, instead, used print 
> statements to visualize the execution flow.)
> 
>Thanks,
> 
>- Paul
> 
>> On Sep 14, 2017, at 1:41 PM, Timothy Farkas  wrote:
>> 
>> Hi All,
>> 
>> As I've been looking at the TopN operator and code generation, I've been 
>> wondering why we have 2 forms of code generation:
>> 
>> 
>> *   One is the method of stitching compiled methods into a template class 
>> with ASM.
>> *   The other simply creates a class that extends the TemplateClass and 
>> compiles it without using custom ASM techniques. This is the PlainJava 
>> technique.
>> 
>> With my high level understanding, it seems like using the PlainJava approach 
>> would be the simplest, and would also probably be the most performant since 
>> we inherit all the java compiler optimizations. Is there a specific reason 
>> why we still use our custom ASM technique? Would it be safe to start 
>> retiring the old ASM technique in favor of PlainJava?
>> 
>> Thanks,
>> Tim
> 
> 
> 



Re: Code Generation Question

2017-09-15 Thread Boaz Ben-Zvi
 Hi Tim,

 The latest Pull Request for the Hash Aggr operator (#938) does turn the 
“plain java” on for the mainline code, as these new template code changes (in 
the Hash Table) caused the “byte twiddling” to break in some subtle way.
This is the first attempt; and as it (hopefully) will work well we’ll continue 
with other operators.

Thanks,

Boaz 

On 9/15/17, 11:34 AM, "Paul Rogers"  wrote:

Hi Tim,

This question has come up multiple times. The “plain Java” is very handy 
for developing code with code generation. It also seems to be faster, smaller 
and simpler than the byte-code-merge mechanism. (However, rewriting byte codes 
is has the benefit of sounding much more sophisticated than simply invoking the 
Java compiler!)

I suspect that there is a healthy concern that there may be subtle problems 
with letting Java compile code without our help in twiddling with the byte 
codes.

The way to address this concern is to run a full set of functional and 
performance tests with the “plan Java” mechanism turned on. But, no one has had 
the time to do that…

That said, feel free to turn “plain Java" on during development; we now 
have sufficient experience to show that Java does, at least in development, 
produce code at least as good as what we produce via our byte-code merge 
mechanisms. With the added benefit that you can debug the code. (By contrast, 
when using the byte-code merge approach, there is no matching source code for 
the debugger to step through… I believe that folks have, instead, used print 
statements to visualize the execution flow.)

Thanks,

- Paul

> On Sep 14, 2017, at 1:41 PM, Timothy Farkas  wrote:
> 
> Hi All,
> 
> As I've been looking at the TopN operator and code generation, I've been 
wondering why we have 2 forms of code generation:
> 
> 
>  *   One is the method of stitching compiled methods into a template 
class with ASM.
>  *   The other simply creates a class that extends the TemplateClass and 
compiles it without using custom ASM techniques. This is the PlainJava 
technique.
> 
> With my high level understanding, it seems like using the PlainJava 
approach would be the simplest, and would also probably be the most performant 
since we inherit all the java compiler optimizations. Is there a specific 
reason why we still use our custom ASM technique? Would it be safe to start 
retiring the old ASM technique in favor of PlainJava?
> 
> Thanks,
> Tim





Re: Code Generation Question

2017-09-15 Thread Paul Rogers
Hi Tim,

This question has come up multiple times. The “plain Java” is very handy for 
developing code with code generation. It also seems to be faster, smaller and 
simpler than the byte-code-merge mechanism. (However, rewriting byte codes is 
has the benefit of sounding much more sophisticated than simply invoking the 
Java compiler!)

I suspect that there is a healthy concern that there may be subtle problems 
with letting Java compile code without our help in twiddling with the byte 
codes.

The way to address this concern is to run a full set of functional and 
performance tests with the “plan Java” mechanism turned on. But, no one has had 
the time to do that…

That said, feel free to turn “plain Java" on during development; we now have 
sufficient experience to show that Java does, at least in development, produce 
code at least as good as what we produce via our byte-code merge mechanisms. 
With the added benefit that you can debug the code. (By contrast, when using 
the byte-code merge approach, there is no matching source code for the debugger 
to step through… I believe that folks have, instead, used print statements to 
visualize the execution flow.)

Thanks,

- Paul

> On Sep 14, 2017, at 1:41 PM, Timothy Farkas  wrote:
> 
> Hi All,
> 
> As I've been looking at the TopN operator and code generation, I've been 
> wondering why we have 2 forms of code generation:
> 
> 
>  *   One is the method of stitching compiled methods into a template class 
> with ASM.
>  *   The other simply creates a class that extends the TemplateClass and 
> compiles it without using custom ASM techniques. This is the PlainJava 
> technique.
> 
> With my high level understanding, it seems like using the PlainJava approach 
> would be the simplest, and would also probably be the most performant since 
> we inherit all the java compiler optimizations. Is there a specific reason 
> why we still use our custom ASM technique? Would it be safe to start retiring 
> the old ASM technique in favor of PlainJava?
> 
> Thanks,
> Tim



[jira] [Resolved] (DRILL-5724) Scan on a local directory containing multiple text files (one or more empty) throws FileNotFoundException

2017-09-15 Thread Prasad Nagaraj Subramanya (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasad Nagaraj Subramanya resolved DRILL-5724.
--
Resolution: Cannot Reproduce

> Scan on a local directory containing multiple text files (one or more empty) 
> throws FileNotFoundException
> -
>
> Key: DRILL-5724
> URL: https://issues.apache.org/jira/browse/DRILL-5724
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 1.11.0
>Reporter: Prasad Nagaraj Subramanya
>
> 1) Create a directory having multiple text files (one or more empty)
> 2) Do a scan on the directory
> {code}
> select * from lfs.`/home/user/dir1`;
> {code}
> The query throws the below error-
> {code}
> Error: SYSTEM ERROR: FileNotFoundException: File 
> file:///home/user/dir1/ does not exist
> Setup failed for CompliantTextRecordReader
> Fragment 1:2
> {code}
> Issue reproducible with - csv, tsv and psv files



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Drill 2.0 (design) hackathon

2017-09-15 Thread Pritesh Maker
Hi All

We are looking forward to hosting the hackathon on Monday. Just a few updates 
on the logistics and agenda

• We are expecting over 25 people attending the event – you can see the 
attendee list at the Eventbrite site -  
https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285
 

• Breakfast will be served starting at 8:30AM – we would like to begin promptly 
at 9AM 

• The agenda has been updated to reflect the speakers (see the update in the 
sheet - 
https://docs.google.com/spreadsheets/d/1PEpgmBNAaPcu9UhWmZ8yPYtXbUGqOAYwH87alWkpCic/edit#gid=0
 )
o Key Note & Introduction – Ted Dunning, Parth Chandra and Aman Sinha 
o Community Contributions – Anil Kumar, John Omernik, Charles Givre and Ted 
Dunning 
o Two tracks for technical design discussions – some topics have initial 
thoughts for the topics and some will have open brainstorming discussions
o Once the discussions are concluded, we will have summaries presented and 
notes shared with the community

• We will have a WebEx for the first two sessions. For the two tracks, we will 
either continue the WebEx or have Hangout links (will publish them to the 
google sheet)
"JOIN WEBEX MEETING
https://mapr.webex.com/mapr/j.php?MTID=m9d39036e3953cce59ea81250c70c6c76
Meeting number (access code): 806 111 950
Meeting password: ApacheDrill"

• For the attendees in person, we have made bookings for a dinner in the 
evening - https://www.yelp.com/biz/chili-garden-restaurant-milpitas 

Looking forward to a fantastic day for the Apache Drill! community!

Thanks,
Pritesh



On 9/5/17, 10:47 PM, "Aman Sinha"  wrote:

Here is the Eventbrite event for registration:


https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285

Please register so we can plan for food and drinks appropriately.

The link also contains a google doc link for the preliminary agenda and a
'Topics' tab with volunteer sign-up column.  Please add your name to the
area(s) of interest.

Thanks and look forward to seeing you all !

-Aman

On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers  wrote:

> A partial list of Drill’s public APIs:
>
> IMHO, highest priority for Drill 2.0.
>
>
>   *   JDBC/ODBC drivers
>   *   Client (for JDBC/ODBC) + ODBC & JDBC
>   *   Client (for full Drill async, columnar)
>   *   Storage plugin
>   *   Format plugin
>   *   System/session options
>   *   Queueing (e.g. ZK-based queues)
>   *   Rest API
>   *   Resource Planning (e.g. max query memory per node)
>   *   Metadata access, storage (e.g. file system locations vs. a 
metastore)
>   *   Metadata files formats (Parquet, views, etc.)
>
> Lower priority for future releases:
>
>
>   *   Query Planning (e.g. Calcite rules)
>   *   Config options
>   *   SQL syntax, especially Drill extensions
>   *   UDF
>   *   Management (e.g. JMX, Rest API calls, etc.)
>   *   Drill File System (HDFS)
>   *   Web UI
>   *   Shell scripts
>
> There are certainly more. Please suggest those that are missing. I’ve
> taken a rough cut at which APIs need forward/backward compatibility first,
> in part based on those that are the “most public” and most likely to
> change. Others are important, but we can’t do them all at once.
>
> Thanks,
>
> - Paul
>
> On Aug 29, 2017, at 6:00 PM, Aman Sinha > wrote:
>
> Hi Paul,
> certainly makes sense to have the API compatibility discussions during 
this
> hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
> changes necessitating changes to the ODBC/JDBC drivers and other external
> applications. As part of this exercise (not during the hackathon but as a
> follow-up action), we also should clearly identify the "public" 
interfaces.
>
>
> I will add this to the agenda.
>
> thanks,
> -Aman
>
> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers > wrote:
>
> Thanks Aman for organizing the Hackathon!
>
> The list included many good ideas for Drill 2.0. Some of those require
> changes to Drill’s “public” interfaces (file format, client protocol, SQL
> behavior, etc.)
>
> At present, Drill has no good mechanism to handle backward/forward
> compatibility at the API level. Protobuf versioning certainly helps, but
> can’t completely solve semantic changes (where a field changes meaning, or
> a non-Protobuf data chunk changes format.) As just one concrete example,
> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> names and data formats will change.
>
> Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
> breaking changes that introduce a 

[GitHub] drill issue #889: DRILL-5691: enhance scalar sub queries checking for the ca...

2017-09-15 Thread weijietong
Github user weijietong commented on the issue:

https://github.com/apache/drill/pull/889
  
@arina-ielchiieva @amansinha100 any further advice ?


---


Resolving object type using ObjectVector

2017-09-15 Thread Charuta Rajopadhye
Hi Team,

I am trying to implement a storage plugin for my database (it supports
Postgres JDBC driver) and can store compound objects.
I am able to parse primary objects in my RecordReader, but for object types:
I create Copier of ObjectVector type, and have overridden it's copy method:

private class JavaObjectCopier extends Copier {

public JavaObjectCopier(int columnIndex, ResultSet result, Mutator mutator)
{

super(columnIndex, result, mutator);

}


@Override

void copy(int index) throws SQLException {

*// this object is of type java.util.HashMap*

*Object object = result.getObject(columnIndex);*

}

The mutator.setSafe method accepts a long value or an object holder,

so i am trying to set my object into the holder's obj field, and passing
the holder to the method.

I also tried calling the set method directly, and other brute-force ways,
but it throws exception:

> Caused by: java.lang.UnsupportedOperationException: ObjectVector does not
> support this
> at 
> org.apache.drill.exec.vector.ObjectVector.makeTransferPair(ObjectVector.java:159)
> ~[vector-1.11.0.jar:1.11.0]
> at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.
> setupNewSchema(ProjectRecordBatch.java:441) ~[drill-java-exec-1.11.0.jar:
> 1.11.0]



Can anybody please point out if am missing anything?


Thanks,

Charuta