Re: zeroVectors() interface for value vectors

2015-08-27 Thread Aman Sinha
Based on the discussion it looks like doing what the first_value/last_value
window function needs from the value vectors without violating the
recommended state transition requires a bit more thought such that we don't
introduce regression. Since testing is blocked on these for 1.2, can Hakim
proceed with his current fix ?  We could create a JIRA to revisit it post
1.2...

Aman

On Wed, Aug 26, 2015 at 3:07 PM, Julien Le Dem  wrote:

> I can take a look at the Vectors and add asserts to enforce the contract is
> respected.
>
> On Wed, Aug 26, 2015 at 2:52 PM, Steven Phillips 
> wrote:
>
> > One possible exception to the access pattern occurs when vectors wrap
> other
> > vectors. Specifically, the offset vectors in Variable Length and Repeated
> > vectors. These vectors are accessed and mutated multiple times. If we are
> > going to implement strict enforcement, we need to consider that case.
> >
> > On Tue, Aug 25, 2015 at 7:15 PM, Jacques Nadeau 
> > wrote:
> >
> > > Yes, by recommendation is to correct the usage in StreamingAggBatch
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Tue, Aug 25, 2015 at 4:52 PM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > I think zeroVector() is mainly used to fill the vector with zeros,
> > which
> > > is
> > > > fine if you call it while the vector is in "mutate" state, but
> > > > StreamingAggBatch does actually call it after setting the value count
> > of
> > > > the value vector which is against the paradigm.
> > > >
> > > >
> > > > On Tue, Aug 25, 2015 at 3:51 PM, Jacques Nadeau 
> > > > wrote:
> > > >
> > > > > In all but one situations, this is an internal concern (making sure
> > to
> > > > zero
> > > > > out the memory).  For fixed width vectors, there is an assumption
> > that
> > > an
> > > > > initial allocation is clean memory (e.g. all zeros in the faces of
> an
> > > int
> > > > > vector).  So this should be pulled off a public vector interface.
> > The
> > > > one
> > > > > place where it is being used today is StreamingAggBatch and I think
> > we
> > > > > should fix that to follow the state paradigm described above.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Tue, Aug 25, 2015 at 3:41 PM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com>
> > > > > wrote:
> > > > >
> > > > > > Another question: FixedWidthVector interface defines a
> zeroVector()
> > > > > method
> > > > > > that
> > > > > > "Zero out the underlying buffer backing this vector" according to
> > > it's
> > > > > > javadoc.
> > > > > >
> > > > > > Where does this method fit in the value vector states described
> > > > earlier ?
> > > > > > it doesn't clear the vector yet it doesn't reset everything to
> the
> > > > after
> > > > > > allocate state.
> > > > > >
> > > > > > On Tue, Aug 25, 2015 at 10:46 AM, Abdel Hakim Deneche <
> > > > > > adene...@maprtech.com
> > > > > > > wrote:
> > > > > >
> > > > > > > One more question about the transition from allocate -> mutate.
> > For
> > > > > Fixed
> > > > > > > width vectors and BitVector you can actually call setSafe()
> > without
> > > > > > calling
> > > > > > > allocateNew() first and it will work. Should it throw an
> > exception
> > > > > > instead
> > > > > > > ?
> > > > > > > not calling allocateNew() has side effects that could cause
> > > setSafe()
> > > > > to
> > > > > > > throw an OversizedAllocationException if you call setSafe()
> then
> > > > > clear()
> > > > > > > multiple times.
> > > > > > >
> > > > > > > On Tue, Aug 25, 2015 at 10:01 AM, Chris Westin <
> > > > > chriswesti...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Maybe we should start by putting these rules in a comment in
> the
> > > > value
> > > > > > >> vector base interfaces? The lack of such information is why
> > there
> > > > are
> > > > > > >> deviations and other expectations.
> > > > > > >>
> > > > > > >> On Tue, Aug 25, 2015 at 8:22 AM, Jacques Nadeau <
> > > jacq...@dremio.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > There are a few unspoken "rules" around vectors:
> > > > > > >> >
> > > > > > >> > - values need to be written in order (e.g. index 0, 1, 2, 5)
> > > > > > >> > - null vectors start with all values as null before writing
> > > > anything
> > > > > > >> > - for variable width types, the offset vector should be all
> > > zeros
> > > > > > before
> > > > > > >> > writing
> > > > > > >> > - you must call setValueCount before a vector can be read
> > > > > > >> > - you should never write to a vector once it has been read.
> > > > > > >> >
> > > > > > >> > The ultimate goal we should get to the point where you the
> > > > > interfaces
> > > > > > >> > guarantee this order of operation:
> > > > > > >> >
> > > > > > >> > allocate > mutate > setvaluecount > access > clear (or
> > allocate
> > > to
> > > > > > start
> > > > > > >> > the process over, xxx).  An

How to get started with a new format conversion and representation

2015-08-27 Thread Edmon Begoli
This might be more of a question for Parquet folks here than Drill-ers, but
nevertheless:

I would like to be able to convert EDI HL7 v.2 messages into Parquet
representation, and make them amenable to Drill querying.
(Here is a sample claim message 837p in HL7 representation (page 8):
http://www.vitahealth.org/Modules/ShowDocument2.aspx?documentid=545 )

This is a lengthy topic which I could discuss in details, but for now I
would like to just know where and how to get started.

Thank you,
Edmon


[jira] [Created] (DRILL-3719) Adding negative sign in front of EXTRACT triggers Assertion Error

2015-08-27 Thread Sean Hsuan-Yi Chu (JIRA)
Sean Hsuan-Yi Chu created DRILL-3719:


 Summary: Adding negative sign in front of EXTRACT triggers 
Assertion Error
 Key: DRILL-3719
 URL: https://issues.apache.org/jira/browse/DRILL-3719
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Reporter: Sean Hsuan-Yi Chu
Assignee: Sean Hsuan-Yi Chu
 Fix For: 1.2.0


A simple repo:
When we typing
{code}
select -EXTRACT(DAY FROM birth_date) from cp.`employee.json`;
{code}
we probably mean 
{code}
select -1 * EXTRACT(DAY FROM birth_date) from cp.`employee.json`;
{code}

However, the first one will trigger assertion error:
Error: SYSTEM ERROR: AssertionError: todo: implement syntax 
PREFIX(-(EXTRACT(FLAG(DAY), $0)))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-3718) quotes in .tsv trigger exception

2015-08-27 Thread Sean Hsuan-Yi Chu (JIRA)
Sean Hsuan-Yi Chu created DRILL-3718:


 Summary: quotes in .tsv trigger exception 
 Key: DRILL-3718
 URL: https://issues.apache.org/jira/browse/DRILL-3718
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Text & CSV
Reporter: Sean Hsuan-Yi Chu
Assignee: Sean Hsuan-Yi Chu


Given a simple tsv file as below
{code}
"a" a
a   a
a
{code}

After getting the first quote, the TextReader would just keep going down the 
entire files, as opposed to stopping at the second "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-3717) Move functions to their own module and preprocess them at build time to speedup startup

2015-08-27 Thread Julien Le Dem (JIRA)
Julien Le Dem created DRILL-3717:


 Summary: Move functions to their own module and preprocess them at 
build time to speedup startup
 Key: DRILL-3717
 URL: https://issues.apache.org/jira/browse/DRILL-3717
 Project: Apache Drill
  Issue Type: Improvement
  Components: Execution - Codegen
Reporter: Julien Le Dem
Assignee: Chris Westin


all the functions included in the drill distirbution are scanned in the 
classpath and then parsed with janino at startup.
This slows down startup and unit tests.
If they were in their own module it would be possible to preprocess at build 
time and have a much better startup time, including in tests.

The classpath scanning library used (reflections) already has a mechanism to 
persist its result. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Khurram Faraaz
Thanks for confirming.

On Thu, Aug 27, 2015 at 2:04 PM, Abdel Hakim Deneche 
wrote:

> the change is part of the fix for DRILL-3555
> 
>
> Thx
>
> On Thu, Aug 27, 2015 at 2:00 PM, Khurram Faraaz 
> wrote:
>
> > Hakim, can you please log a JIRA for this change, so we have a track of
> why
> > this change was made. Thanks.
> >
> > On Thu, Aug 27, 2015 at 10:30 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com
> > > wrote:
> >
> > > Thanks Steven
> > >
> > > On Thu, Aug 27, 2015 at 10:19 AM, Steven Phillips 
> > > wrote:
> > >
> > > > I think it probably isn't needed anymore. O believe it is a holdover
> > from
> > > > before spilling was implemented. It doesn't seem to serve any purpose
> > > now.
> > > >
> > > > On Thu, Aug 27, 2015 at 9:17 AM, Abdel Hakim Deneche <
> > > > adene...@maprtech.com>
> > > > wrote:
> > > >
> > > > > anyone ?
> > > > >
> > > > > On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com>
> > > > > wrote:
> > > > >
> > > > > > When running a window function query on large datasets,
> > > > > > increasing planner.memory.max_query_memory_per_node can actually
> > help
> > > > the
> > > > > > query not run out of memory. But in some cases this can cause
> some
> > > > issues
> > > > > > (see DRILL-3555 <
> https://issues.apache.org/jira/browse/DRILL-3555
> > >)
> > > > > >
> > > > > > This seems to be caused by a hardcoded limit in ExternalSort
> called
> > > > > > MAX_SORT_BYTES. What is the purpose of this limit ?
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Abdelhakim Deneche
> > > > > >
> > > > > > Software Engineer
> > > > > >
> > > > > >   
> > > > > >
> > > > > >
> > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > <
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Abdelhakim Deneche
> > > > >
> > > > > Software Engineer
> > > > >
> > > > >   
> > > > >
> > > > >
> > > > > Now Available - Free Hadoop On-Demand Training
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Abdel Hakim Deneche
the change is part of the fix for DRILL-3555


Thx

On Thu, Aug 27, 2015 at 2:00 PM, Khurram Faraaz 
wrote:

> Hakim, can you please log a JIRA for this change, so we have a track of why
> this change was made. Thanks.
>
> On Thu, Aug 27, 2015 at 10:30 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > wrote:
>
> > Thanks Steven
> >
> > On Thu, Aug 27, 2015 at 10:19 AM, Steven Phillips 
> > wrote:
> >
> > > I think it probably isn't needed anymore. O believe it is a holdover
> from
> > > before spilling was implemented. It doesn't seem to serve any purpose
> > now.
> > >
> > > On Thu, Aug 27, 2015 at 9:17 AM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > anyone ?
> > > >
> > > > On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche <
> > > > adene...@maprtech.com>
> > > > wrote:
> > > >
> > > > > When running a window function query on large datasets,
> > > > > increasing planner.memory.max_query_memory_per_node can actually
> help
> > > the
> > > > > query not run out of memory. But in some cases this can cause some
> > > issues
> > > > > (see DRILL-3555  >)
> > > > >
> > > > > This seems to be caused by a hardcoded limit in ExternalSort called
> > > > > MAX_SORT_BYTES. What is the purpose of this limit ?
> > > > >
> > > > > --
> > > > >
> > > > > Abdelhakim Deneche
> > > > >
> > > > > Software Engineer
> > > > >
> > > > >   
> > > > >
> > > > >
> > > > > Now Available - Free Hadoop On-Demand Training
> > > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Khurram Faraaz
Hakim, can you please log a JIRA for this change, so we have a track of why
this change was made. Thanks.

On Thu, Aug 27, 2015 at 10:30 AM, Abdel Hakim Deneche  wrote:

> Thanks Steven
>
> On Thu, Aug 27, 2015 at 10:19 AM, Steven Phillips 
> wrote:
>
> > I think it probably isn't needed anymore. O believe it is a holdover from
> > before spilling was implemented. It doesn't seem to serve any purpose
> now.
> >
> > On Thu, Aug 27, 2015 at 9:17 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > wrote:
> >
> > > anyone ?
> > >
> > > On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > When running a window function query on large datasets,
> > > > increasing planner.memory.max_query_memory_per_node can actually help
> > the
> > > > query not run out of memory. But in some cases this can cause some
> > issues
> > > > (see DRILL-3555 )
> > > >
> > > > This seems to be caused by a hardcoded limit in ExternalSort called
> > > > MAX_SORT_BYTES. What is the purpose of this limit ?
> > > >
> > > > --
> > > >
> > > > Abdelhakim Deneche
> > > >
> > > > Software Engineer
> > > >
> > > >   
> > > >
> > > >
> > > > Now Available - Free Hadoop On-Demand Training
> > > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


[GitHub] drill pull request: DRILL-3555: Changing defaults for planner.memo...

2015-08-27 Thread adeneche
GitHub user adeneche opened a pull request:

https://github.com/apache/drill/pull/137

DRILL-3555: Changing defaults for planner.memory.max_query_memory_per…

…_node causes queries with window function to fail

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adeneche/incubator-drill DRILL-3555

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/137.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #137


commit bd6f3749c6358933e61034ae8f211028c14c8cc6
Author: adeneche 
Date:   2015-08-27T19:57:23Z

DRILL-3555: Changing defaults for planner.memory.max_query_memory_per_node 
causes queries with window function to fail




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Review Request 37854: DRILL-2190, DRILL-2313: Problem resolved in Calcite. Added unit tests

2015-08-27 Thread Sean Hsuan-Yi Chu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37854/
---

Review request for drill, Aman Sinha and Jinfeng Ni.


Bugs: DRILL-2190 and DRILL-2313
https://issues.apache.org/jira/browse/DRILL-2190
https://issues.apache.org/jira/browse/DRILL-2313


Repository: drill-git


Description
---

Problem resolved in Calcite. Added unit tests


Diffs
-

  exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java 6b74ecf 

Diff: https://reviews.apache.org/r/37854/diff/


Testing
---

unit


Thanks,

Sean Hsuan-Yi Chu



[jira] [Resolved] (DRILL-3716) Drill should push filter past aggregate in order to improve query performance.

2015-08-27 Thread Jinfeng Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinfeng Ni resolved DRILL-3716.
---
Resolution: Fixed

> Drill should push filter past aggregate in order to improve query performance.
> --
>
> Key: DRILL-3716
> URL: https://issues.apache.org/jira/browse/DRILL-3716
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Query Planning & Optimization
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
> Fix For: 1.2.0
>
>
> For the following query which has a filter on top of an aggregation, Drill's 
> currently push the filter pass through the aggregation. As a result, we may 
> miss some optimization opportunity. For instance, such filter could 
> potentially been pushed into scan if it qualifies for partition pruning.
> For the following query:
> {code}
> select n_regionkey, cnt from 
>  (select n_regionkey, count(*) cnt 
>   from (select n.n_nationkey, n.n_regionkey, n.n_name 
>from cp.`tpch/nation.parquet` n 
>   left join 
>cp.`tpch/region.parquet` r 
> on n.n_regionkey = r.r_regionkey) 
>group by n_regionkey) 
> where n_regionkey = 2;
> {code}
> The current plan shows a filter (00-04) on top of aggregation(00-05). The 
> better plan would have the filter pushed pass the aggregation. 
> The root cause of this problem is Drill's ruleset does not include  
> FilterAggregateTransoposeRule from Calcite library.
> {code}
> 00-01  Project(n_regionkey=[$0], cnt=[$1])
> 00-02Project(n_regionkey=[$0], cnt=[$1])
> 00-03  SelectionVectorRemover
> 00-04Filter(condition=[=($0, 2)])
> 00-05  StreamAgg(group=[{0}], cnt=[COUNT()])
> 00-06Project(n_regionkey=[$0])
> 00-07  MergeJoin(condition=[=($0, $1)], joinType=[left])
> 00-09SelectionVectorRemover
> 00-11  Sort(sort0=[$0], dir0=[ASC])
> 00-13Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], 
> selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, 
> columns=[`n_regionkey`]]])
> 00-08SelectionVectorRemover
> 00-10  Sort(sort0=[$0], dir0=[ASC])
> 00-12Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=classpath:/tpch/region.parquet]], 
> selectionRoot=classpath:/tpch/region.parquet, numFiles=1, 
> columns=[`r_regionkey`]]])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Abdel Hakim Deneche
Thanks Steven

On Thu, Aug 27, 2015 at 10:19 AM, Steven Phillips  wrote:

> I think it probably isn't needed anymore. O believe it is a holdover from
> before spilling was implemented. It doesn't seem to serve any purpose now.
>
> On Thu, Aug 27, 2015 at 9:17 AM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > anyone ?
> >
> > On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > wrote:
> >
> > > When running a window function query on large datasets,
> > > increasing planner.memory.max_query_memory_per_node can actually help
> the
> > > query not run out of memory. But in some cases this can cause some
> issues
> > > (see DRILL-3555 )
> > >
> > > This seems to be caused by a hardcoded limit in ExternalSort called
> > > MAX_SORT_BYTES. What is the purpose of this limit ?
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training



Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Steven Phillips
I think it probably isn't needed anymore. O believe it is a holdover from
before spilling was implemented. It doesn't seem to serve any purpose now.

On Thu, Aug 27, 2015 at 9:17 AM, Abdel Hakim Deneche 
wrote:

> anyone ?
>
> On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > When running a window function query on large datasets,
> > increasing planner.memory.max_query_memory_per_node can actually help the
> > query not run out of memory. But in some cases this can cause some issues
> > (see DRILL-3555 )
> >
> > This seems to be caused by a hardcoded limit in ExternalSort called
> > MAX_SORT_BYTES. What is the purpose of this limit ?
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>


[jira] [Created] (DRILL-3716) Drill should push filter past aggregate in order to improve query performance.

2015-08-27 Thread Jinfeng Ni (JIRA)
Jinfeng Ni created DRILL-3716:
-

 Summary: Drill should push filter past aggregate in order to 
improve query performance.
 Key: DRILL-3716
 URL: https://issues.apache.org/jira/browse/DRILL-3716
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Reporter: Jinfeng Ni
Assignee: Jinfeng Ni
 Fix For: 1.2.0


For the following query which has a filter on top of an aggregation, Drill's 
currently push the filter pass through the aggregation. As a result, we may 
miss some optimization opportunity. For instance, such filter could potentially 
been pushed into scan if it qualifies for partition pruning.

For the following query:

{code}
select n_regionkey, cnt from 
 (select n_regionkey, count(*) cnt 
  from (select n.n_nationkey, n.n_regionkey, n.n_name 
   from cp.`tpch/nation.parquet` n 
  left join 
   cp.`tpch/region.parquet` r 
on n.n_regionkey = r.r_regionkey) 
   group by n_regionkey) 
where n_regionkey = 2;
{code}

The current plan shows a filter (00-04) on top of aggregation(00-05). The 
better plan would have the filter pushed pass the aggregation. 

The root cause of this problem is Drill's ruleset does not include  
FilterAggregateTransoposeRule from Calcite library.

{code}
00-01  Project(n_regionkey=[$0], cnt=[$1])
00-02Project(n_regionkey=[$0], cnt=[$1])
00-03  SelectionVectorRemover
00-04Filter(condition=[=($0, 2)])
00-05  StreamAgg(group=[{0}], cnt=[COUNT()])
00-06Project(n_regionkey=[$0])
00-07  MergeJoin(condition=[=($0, $1)], joinType=[left])
00-09SelectionVectorRemover
00-11  Sort(sort0=[$0], dir0=[ASC])
00-13Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/nation.parquet]], 
selectionRoot=classpath:/tpch/nation.parquet, numFiles=1, 
columns=[`n_regionkey`]]])
00-08SelectionVectorRemover
00-10  Sort(sort0=[$0], dir0=[ASC])
00-12Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=classpath:/tpch/region.parquet]], 
selectionRoot=classpath:/tpch/region.parquet, numFiles=1, 
columns=[`r_regionkey`]]])
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Abdel Hakim Deneche
anyone ?

On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche 
wrote:

> When running a window function query on large datasets,
> increasing planner.memory.max_query_memory_per_node can actually help the
> query not run out of memory. But in some cases this can cause some issues
> (see DRILL-3555 )
>
> This seems to be caused by a hardcoded limit in ExternalSort called
> MAX_SORT_BYTES. What is the purpose of this limit ?
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> 
>



-- 

Abdelhakim Deneche

Software Engineer

  


Now Available - Free Hadoop On-Demand Training