from:"Steven Phillips"

Re: drill and hbase

2016-05-04 Thread Steven Phillips

No one has yet implemented an hbase writer in Drill. Without that, it is
not possible to write into an hbase table.

I don't know if anyone currently plans to work on this. If this something
you are interested in taking on, I can point you in the right direction.

On Wed, May 4, 2016 at 6:36 AM, Plamen Paskov  wrote:

> Hi folks,
> I'm trying to use apache drill + hbase for the following scenario. I have
> to create an events analytics system which is basically an api that will
> accept events and store them inside hbase table and after that i have to
> run funnel queries over the data.I will need to support different type of
> events each with it's own subset of parameters. I generated a test table
> with 75M rows and imported it in hbase. Now i'm trying to partition the
> events table by event_type with this command:
>
> USE hbase;
>
> CREATE TABLE events_part (user_id, type, timestamp, browser,
> browser_version) PARTITION BY (type) AS SELECT e.generic.user_id,
> e.generic.type FROM events AS e;
>
> but i receive this error: *PARSE ERROR: Unable to create or drop
> tables/views. Schema [hbase] is immutable.*
>
> I read that for hbase schema it's not possible to define workspaces and
> mark them as writable. How can i workaround this situation? I need to
> partition the data by event type because i'm expecting a lot of information
> to be stored in the table and query the data with *WHERE event_type =
> 'some_event_type'*.
>
> Thanks in advance !
>

median, quantile

2016-04-13 Thread Steven Phillips

I submitted a pull request a little while ago that introduces (approximate)
median and quantile functions using the tdigest library.

https://github.com/apache/drill/pull/456

It would be great if I could get some feedback on this. Specifically, is it
ok to call these functions median and quantile, given that they are not
exact.

[jira] [Created] (DRILL-4566) Add TDigest functions for computing median and quantile

2016-03-30 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4566:
--

 Summary: Add TDigest functions for computing median and quantile
 Key: DRILL-4566
 URL: https://issues.apache.org/jira/browse/DRILL-4566
 Project: Apache Drill
  Issue Type: New Feature
Reporter: Steven Phillips
Assignee: Steven Phillips


The tdigest library can be used by Drill to compute approximate value and 
percentiles with using too much memory or spilling to disk, which would be 
required to compute exactly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-4562) NPE when evaluating expression on nested union type

2016-03-30 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4562:
--

 Summary: NPE when evaluating expression on nested union type
 Key: DRILL-4562
 URL: https://issues.apache.org/jira/browse/DRILL-4562
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips


A simple reproduction:
{code}
select typeof(t.a.b) c from `f.json` t
{code}
where f.json contains:
{code}
{a : { b : 1 }}
{a : { b: "hello" }}
{a : { b: { c : 2} }}
{code}
Fails with following:
{code}
(java.lang.NullPointerException) null

org.apache.drill.exec.vector.complex.FieldIdUtil.getFieldIdIfMatchesUnion():40
org.apache.drill.exec.vector.complex.FieldIdUtil.getFieldIdIfMatches():141
org.apache.drill.exec.vector.complex.FieldIdUtil.getFieldId():207
org.apache.drill.exec.record.SimpleVectorWrapper.getFieldIdIfMatches():101
org.apache.drill.exec.record.VectorContainer.getValueVectorId():269
org.apache.drill.exec.physical.impl.ScanBatch.getValueVectorId():325

org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.getValueVectorId():182

org.apache.drill.exec.expr.ExpressionTreeMaterializer$MaterializeVisitor.visitSchemaPath():628

org.apache.drill.exec.expr.ExpressionTreeMaterializer$MaterializeVisitor.visitSchemaPath():217
org.apache.drill.common.expression.SchemaPath.accept():152

org.apache.drill.exec.expr.ExpressionTreeMaterializer$MaterializeVisitor.visitFunctionCall():274

org.apache.drill.exec.expr.ExpressionTreeMaterializer$MaterializeVisitor.visitFunctionCall():217
org.apache.drill.common.expression.FunctionCall.accept():60
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Failure Behavior

2016-03-28 Thread Steven Phillips

If a fragment has already begun execution and sent some data to downstream
fragments, there is no way to simply restart the failed fragment, because
we would also have to restart any downstream fragments that consumed that
output, and so on up the tree, as well as restart any leaf fragments that
fed into any of those fragments. This is because we don't store
intermediate results to disk.

The case where I think it would even be possible would be if a node died
before sending any data downstream. But I think the only way to be sure of
this would be to poll all of the downstream fragments and verify that no
data from the failed fragment was ever received. I think this would add a
lot of complication and overhead to Drill.

On Sat, Mar 26, 2016 at 10:03 AM, John Omernik  wrote:

> Thanks for the responses.. So, even if the drillbit that died wasn't the
> foreman the query would fail? Interesting... Is there any mechanism for
> reassigning fragments? *try harder* so to speak?  I guess does this play
> out too if I have a query and say something on that node caused a fragment
> to fail, that it could be tried somewhere else... So I am not trying to
> recreate map reduce in Drill (although I am sorta asking about similar
> features), but in a distributed environment, what is the cost to allow the
> foremen to time out a fragment and try again elsewhere. Say there was a
> heart beat sent back from the bits running a fragment, and if the heartbeat
> and lack of results exceeded 10 seconds, have the foremen try again
> somewhere else (up to X times configured by a setting).  I am just curious
> here for my own knowledge what makes that hard in a system like Drill.
>
> On Sat, Mar 26, 2016 at 10:47 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > wrote:
>
> > the query could succeed is if all fragments that were running on the
> > now-dead node already finished. Other than that, the query fails.
> >
> > On Sat, Mar 26, 2016 at 4:45 PM, Neeraja Rentachintala <
> > nrentachint...@maprtech.com> wrote:
> >
> > > As far as I know, there is no failure handling in Drill. The query
> dies.
> > >
> > > On Sat, Mar 26, 2016 at 7:52 AM, John Omernik 
> wrote:
> > >
> > > > With distributed Drill, what is the expected/desired bit failure
> > > behavior.
> > > > I.e. if you are running, and certain fragments end up on a node with
> a
> > > bit
> > > > in a flaky state (or a bit that suddenly dies).  What is the desired
> > and
> > > > actual behavior of the query? I am guessing that if the bit was
> > foreman,
> > > > the query dies, I guess that's unavoidable, but if it's just a
> worker,
> > > does
> > > > the foreman detect this and reschedule the fragment or does the query
> > die
> > > > any way?
> > > >
> > > > John
> > > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> > >
> >
>

Re: Embedded Hazelcast for a distributed Drill cluster

2016-03-28 Thread Steven Phillips

We actually removed the concept of a distributed cache from drill
altogether. So currently nothing is replacing HazelCast.

The distributed cache was used for storing the initialization-data for
intermediate fragments. Only leaf node fragments were sent via the RPC
layer. But the distributed cache added a lot of complexity, and didn't
provide very much benefit. We decided to get rid of it and simply send all
PlanFragments via the rpc layer.

On Mon, Mar 28, 2016 at 12:56 PM, Pradeeban Kathiravelu <
kk.pradee...@gmail.com> wrote:

> Thanks Neeraja for your quick response.
>
> May I know what replaced Hazelcast in that case?
>
> I mean, how does Drill distributed mode currently offer the multicast and
> subnet scenarios, or are these scenarios not valid anymore?
>
> Regards,
> Pradeeban.
>
> On Mon, Mar 28, 2016 at 3:52 PM, Neeraja Rentachintala <
> nrentachint...@maprtech.com> wrote:
>
> > Its not currently used as far as I know.
> > Drill used this at some point, but we removed it due to issues in
> > multicast/subnet scenarios in Drill distributed mode.
> >
> > -Neeraja
> >
> > On Mon, Mar 28, 2016 at 12:49 PM, Pradeeban Kathiravelu <
> > kk.pradee...@gmail.com> wrote:
> >
> > > Hi,
> > > [1] states that Drill uses Hazelcast as an embedded distributed cache
> to
> > > distribute and store metadata and locality information.
> > >
> > > However, when I cloned the git repository and looked into the code, it
> > does
> > > not look like Hazelcast is used, except for some unused variable
> > > definitions and pom definitions.
> > >
> > > I also found a resolved bug report on Hazelcast cluster membership [2].
> > >
> > > May I know whether Hazelcast is currently used by Drill, and what does
> > > exactly Drill achieve by using it? Relevant pointers to existing
> > > discussions (if this was already discussed) or code location (if this
> was
> > > indeed implemented in Drill) are also appreciated.
> > >
> > > [1]
> > >
> > >
> >
> http://www.slideshare.net/Hadoop_Summit/understanding-the-value-and-architecture-of-apache-drill
> > > [2] https://issues.apache.org/jira/browse/DRILL-489
> > >
> > > Thank you.
> > > Regards,
> > > Pradeeban.
> > > --
> > > Pradeeban Kathiravelu.
> > > PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed
> Computing,
> > > INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> > > Portugal.
> > > Biomedical Informatics Software Engineer, Emory University School of
> > > Medicine.
> > >
> > > Blog: [Llovizna] http://kkpradeeban.blogspot.com/
> > > LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
> > >
> >
>
>
>
> --
> Pradeeban Kathiravelu.
> PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
> INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> Portugal.
> Biomedical Informatics Software Engineer, Emory University School of
> Medicine.
>
> Blog: [Llovizna] http://kkpradeeban.blogspot.com/
> LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
>

[DISCUSS] Remove required type

2016-03-21 Thread Steven Phillips

I have been thinking about this for a while now, and I feel it would be a
good idea to remove the Required vector types from Drill, and only use the
Nullable version of vectors. I think this will greatly simplify the code.
It will also simplify the creation of UDFs. As is, if a function has custom
null handling (i.e. INTERNAL), the function has to be separately
implemented for each permutation of nullability of the inputs. But if drill
data types are always nullable, this wouldn't be a problem.

I don't think there would be much impact on performance. In practice, I
think the required type is used very rarely. And there are other ways we
can optimize for when a column is known to have no nulls.

Thoughts?

[jira] [Created] (DRILL-4489) Add ValueVector tests from Drill

2016-03-08 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4489:
--

 Summary: Add ValueVector tests from Drill
 Key: DRILL-4489
 URL: https://issues.apache.org/jira/browse/DRILL-4489
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips


There are some simple ValueVector tests that should be included in the Arrow 
project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Time for the 1.6 Release

2016-03-07 Thread Steven Phillips

DRILL-4486 is a pretty simple fix. Without it, currently some regex queries
will fail.

I think we should include it in the release.


https://github.com/apache/drill/pull/412

On Mon, Mar 7, 2016 at 2:15 PM, Jason Altekruse 
wrote:

> There is a small test issue with some of the refactoring that accompanied
> the operator unit tests. These don't change any user-facing behavior, so I
> don't think it really needs to get into the release. I will be working to
> merge them into master after we cut the release branch.
>
> The change to update the avatica JDBC driver version also does not make any
> behavior changes, so I think it also makes sense to keep it off the release
> branch.
>
> I will be merging the fix for 4375 the maven release profile, 4474 wrong
> results with incorrect creation of DirectScan and 4332 fixing a unit test
> to work in JDK 8, after another test run.
>
> On Mon, Mar 7, 2016 at 1:53 PM, Venki Korukanti  >
> wrote:
>
> > WebUI profile issue: this is a regression cause by refactoring of Calcite
> > integration code (DRILL-4465) which sets the text plan only if debug is
> > enabled. Will submit a patch soon.
> >
> > On Mon, Mar 7, 2016 at 1:29 PM, Sudheesh Katkam 
> > wrote:
> >
> > > Thanks for clarifying Jacques.
> > >
> > > I haven’t looked into the fix for DRILL-4384; I reopened it because the
> > > description mentioned “visualized plan” section is (also) empty.
> > >
> > > Thank you,
> > > Sudheesh
> > >
> > > > On Mar 7, 2016, at 1:08 PM, Jacques Nadeau 
> wrote:
> > > >
> > > > The new bug (currently filed under DRILL-4384) is a completely
> > different
> > > > bug than the original (original one has to do with profile metrics,
> > this
> > > > has to do with plan text). I try to look at it tonight if noone can
> get
> > > to
> > > > it sooner.
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Mon, Mar 7, 2016 at 12:37 PM, Parth Chandra <
> pchan...@maprtech.com>
> > > > wrote:
> > > >
> > > >> DRILL-4384 is a blocker for the release though
> > > >>
> > > >> On Mon, Mar 7, 2016 at 12:01 PM, Sudheesh Katkam <
> > skat...@maprtech.com>
> > > >> wrote:
> > > >>
> > > >>> I reopened DRILL-4384 <
> > > https://issues.apache.org/jira/browse/DRILL-4384>
> > > >>> (blocker); it is assigned to Jacques.
> > > >>>
> > > >>> On the latest master, the visualized and physical plan tabs on web
> UI
> > > are
> > > >>> empty.
> > > >>>
> > > >>> Thank you,
> > > >>> Sudheesh
> > > >>>
> > >  On Mar 7, 2016, at 11:39 AM, Jason Altekruse <
> > > altekruseja...@gmail.com
> > > >>>
> > > >>> wrote:
> > > 
> > >  I don't know if there are any specific time constraints for
> getting
> > > out
> > > >>> the
> > >  release, but I'm inclined to go with Vicky on DRILL-4477, at least
> > > some
> > >  investigation into the scope of a fix would be good. I think it's
> > >  reasonably big problem whether it's a regression or not.
> > > 
> > >  On Mon, Mar 7, 2016 at 11:35 AM, Zelaine Fong  >
> > > >>> wrote:
> > > 
> > > > Hakim,
> > > >
> > > > Yes, we'll include this in the release.
> > > >
> > > > -- Zelaine
> > > >
> > > > On Mon, Mar 7, 2016 at 9:31 AM, Abdel Hakim Deneche <
> > > >>> adene...@maprtech.com
> > > >>
> > > > wrote:
> > > >
> > > >> If we still have time, I would like to include DRILL-4457 [1],
> > it's
> > > a
> > > > wrong
> > > >> results issue, I already have a fix and it's passing all tests,
> I
> > am
> > > >>> just
> > > >> waiting for a review [2]
> > > >>
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/DRILL-4457
> > > >> [2] https://github.com/apache/drill/pull/410
> > > >>
> > > >> On Mon, Mar 7, 2016 at 4:50 PM, Parth Chandra <
> par...@apache.org>
> > > >>> wrote:
> > > >>
> > > >>> Hi guys,
> > > >>>
> > > >>> I'm still waiting for the following to be reviewed/merged by
> > today.
> > > >>>
> > > >>> DRILL-4437 (and others)/pr 394 (Operator unit test framework).
> > > >> Waiting
> > > > to
> > > >>> be merged (Jason)
> > > >>>
> > > >>> DRILL-4372/pr 377(?) (Drill Operators and Functions should
> > > correctly
> > > >> expose
> > > >>> their types within Calcite.) - (Jinfeng to review)
> > > >>>
> > > >>> DRILL-4313/pr 396  (Improved client randomization. Update JIRA
> > with
> > > >>> warnings about using the feature ) (Hanifi/Sudheesh/Paul -
> patch
> > > >> reviewed.
> > > >>> No +1)
> > > >>>
> > > >>> DRILL-4375/pr 402 (Fix the maven release profile) - (Jason -
> > patch
> > > >>> reviewed. Ready to merge?)
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> Parth
> > > >>>
> > > >>> On Sun, Mar 6, 2016 at 12:01 PM, Aditya <
> adityakish...@gmail.com
> > >
> > > > wrote:
> >

[jira] [Created] (DRILL-4486) Expression serializer incorrectly serializes escaped characters

2016-03-07 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4486:
--

 Summary: Expression serializer incorrectly serializes escaped 
characters
 Key: DRILL-4486
 URL: https://issues.apache.org/jira/browse/DRILL-4486
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips


the drill expression parser requires backslashes to be escaped. But the 
ExpressionStringBuilder is not properly escaping them. This causes problems, 
especially in the case of regex expressions run with parallel execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-4455) Depend on Apache Arrow for Vector and Memory

2016-02-29 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4455:
--

 Summary: Depend on Apache Arrow for Vector and Memory
 Key: DRILL-4455
 URL: https://issues.apache.org/jira/browse/DRILL-4455
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips
 Fix For: 1.7.0


The code for value vectors and memory has been split and contributed to the 
apache arrow repository. In order to help this project advance, Drill should 
depend on the arrow project instead of internal value vector code.

This change will require recompiling any external code, such as UDFs and 
StoragePlugins. The changes will mainly just involve renaming the classes to 
the org.apache.arrow namespace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: expected behavior when using wild cards in table name

2016-02-11 Thread Steven Phillips

I don't understand why they wouldn't be allowed. They seem perfectly valid.

On Thu, Feb 11, 2016 at 9:42 AM, Abdel Hakim Deneche 
wrote:

> I have the following table tpch100/lineitem that contains 97 parquet files:
>
> tpch100/lineitem/part-m-0.parquet
> tpch100/lineitem/part-m-1.parquet
> tpch100/lineitem/part-m-2.parquet
>
> ...
> tpch100/lineitem/part-m-00096.parquet
>
> I can run the following queries:
>
> SELECT COUNT(*) FROM `tpch100/lineit*;
> SELECT COUNT(*) FROM `tpch100/lineitem/part-m-0001*';
> SELECT COUNT(*) FROM `tpch100/lineitem/*';
>
> The third query will fail if the table has metadata (it has to do with the
> .drill.parquet_metadata showing up at the top of the file system results)
>
> My question is: should the 2nd and 3rd queries be allowed, if we are
> querying a table folder that doesn't contain any sub folders  ?
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
>

[jira] [Created] (DRILL-4382) Remove dependency on drill-logical from vector submodule

2016-02-10 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4382:
--

 Summary: Remove dependency on drill-logical from vector submodule
 Key: DRILL-4382
 URL: https://issues.apache.org/jira/browse/DRILL-4382
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips


This is in preparation for transitioning the code to the Apache Arrow project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Time for a 1.5 release?

2016-01-28 Thread Steven Phillips

I just wanted to bring up an issue that I just now discovered, that has
caused me a fair amount of grief.

https://github.com/apache/drill/pull/300/commits

DRILL-4198 changes a user-facing API, and causes StoragePlugins that were
compiled against currently released versions of Drill to no longer
functional properly. I would prefer that this breaking change be modified
to be backward compatible if possible.

On Thu, Jan 28, 2016 at 11:23 AM, Jason Altekruse 
wrote:

> Hi Aman,
>
> This is the failure that he was seeing. He figured out that the new
> exclusions in jdbc-all were not being respected when the build was run with
> an older Maven version, causing the jar size to increase significantly. He
> added an enforcer to make sure the JAR didn't grow unexpectedly. Can you
> try to update your maven version and re-run the build?
>
> - Jason
>
> On Thu, Jan 28, 2016 at 11:18 AM, Aman Sinha  wrote:
>
> > Jacques, I am getting the following build failure on the latest master
> > branch...is this what you saw for the Apache build ?  My mvn version
> output
> > is shown below.  Should we all be upgrading to a newer mvn ?
> >
> >
> > [INFO] --- maven-enforcer-plugin:1.3.1:enforce
> > (enforce-jdbc-jar-compactness) @ drill-jdbc-all ---
> > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireFilesSize
> failed
> > with message:
> > The file drill-jdbc-all-1.5.0-SNAPSHOT.jar is outside the expected size
> > range.
> >
> >   This is likely due to you adding new dependencies to a
> > java-exec and not updating the excludes in this module. This is important
> > as it minimizes the size of the dependency of Drill application users.
> >
> >
> /Users/asinha/incubator-drill/exec/jdbc-all/target/drill-jdbc-all-1.5.0-SNAPSHOT.jar
> > size (44664121) too large. Max. is
> >
> >
> 2000/Users/asinha/incubator-drill/exec/jdbc-all/target/drill-jdbc-all-1.5.0-SNAPSHOT.jar
> >
> >
> > Administrators-MacBook-Pro-144:incubator-drill asinha$ mvn --version
> > Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19
> > 05:51:28-0800)
> > Maven home: /opt/local/share/java/maven3
> > Java version: 1.7.0_45, vendor: Oracle Corporation
> > Java home:
> > /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre
> > Default locale: en_US, platform encoding: UTF-8
> > OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"
> >
> > On Thu, Jan 28, 2016 at 8:20 AM, Jacques Nadeau 
> > wrote:
> >
> > > Build back to normal. It looks like the Apache server was using an old
> > > version of Maven. Once I switched to something more recent, the build
> > > passed.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Thu, Jan 28, 2016 at 7:02 AM, Jacques Nadeau 
> > > wrote:
> > >
> > > > Hmm... this merge caused the Apache build to fail. Investigating...
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Thu, Jan 28, 2016 at 6:31 AM, Jacques Nadeau 
> > > > wrote:
> > > >
> > > >> I got clean regression runs as well. I've merged the patch.
> > > >>
> > > >> Jason, you want to start the release process?
> > > >>
> > > >> --
> > > >> Jacques Nadeau
> > > >> CTO and Co-Founder, Dremio
> > > >>
> > > >> On Wed, Jan 27, 2016 at 10:42 PM, Abhishek Girish  >
> > > >> wrote:
> > > >>
> > > >>> Had two clean Functional runs. TPC-H SF100 was also successful.
> > > >>>
> > > >>> On Wed, Jan 27, 2016 at 10:07 PM, rahul challapalli <
> > > >>> challapallira...@gmail.com> wrote:
> > > >>>
> > > >>> > Kicked off a functional run with your branch. Will let you know
> > once
> > > it
> > > >>> > finishes
> > > >>> >
> > > >>> > - Rahul
> > > >>> >
> > > >>> > On Wed, Jan 27, 2016 at 9:56 PM, Jacques Nadeau <
> > jacq...@dremio.com>
> > > >>> > wrote:
> > > >>> >
> > > >>> > > 4196 was merged today. I have an updated patch for 4291 that is
> > > >>> ready.
> > > >>> > > Unfortunately, it seems that something isn't working with our
> > > >>> extended
> > > >>> > > tests so I haven't been able to run an extended regression.
> Unit
> > > >>> tests
> > > >>> > > pass. Is someone else possibly able to run a regression suite
> > > against
> > > >>> > this
> > > >>> > > branch [1] so we can confirm things look good and start the
> > release
> > > >>> > > process?
> > > >>> > >
> > > >>> > > thanks,
> > > >>> > > Jacques
> > > >>> > >
> > > >>> > > [1] https://github.com/jacques-n/drill/tree/DRILL-4291v2
> > > >>> > >
> > > >>> > > --
> > > >>> > > Jacques Nadeau
> > > >>> > > CTO and Co-Founder, Dremio
> > > >>> > >
> > > >>> > > On Mon, Jan 25, 2016 at 11:20 AM, Jacques Nadeau <
> > > jacq...@dremio.com
> > > >>> >
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > I think the main things are 4196 and 4291 should be
> completed.
> > I
> > > >>> know
> > > >>> > > Amit
> > > >>> > > > was able to reproduce

Re: Time for a 1.5 release?

2016-01-21 Thread Steven Phillips

I merged a patch yesterday that I believe addresses that issue. Can you see
if you still hit it?

On Thu, Jan 21, 2016 at 8:39 AM, Jacques Nadeau  wrote:

> Jinfeng, can you open a jira for the failing test if one isn't open?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Wed, Jan 20, 2016 at 7:57 AM, Jinfeng Ni  wrote:
>
> > I still saw mvn full build failed on linux due to the following unit
> > test case sometime:
> >
> > Tests in error:
> >   TestTopNSchemaChanges.testMissingColumn:206 »  at position 0 column
> > '`kl1`' mi...
> >
> > Either we comment out that unit test, or we should fix the test case.
> > Otherwise, people may see maven build failure from the 1.5.0 release.
> >
> >
> >
> > On Tue, Jan 19, 2016 at 4:18 PM, Jacques Nadeau 
> > wrote:
> > > Bumping this thread...
> > >
> > > Here are the issues that were mentioned in this thread along with a
> > > proposed categorization:
> > >
> > > Release Blockers
> > > In-progress Amit https://issues.apache.org/jira/browse/DRILL-4190
> > > In-progress Amit https://issues.apache.org/jira/browse/DRILL-4196
> > > Ready to merge Jacques
> https://issues.apache.org/jira/browse/DRILL-4246
> > > In-review Jinfeng https://issues.apache.org/jira/browse/DRILL-4256
> > > In-progress Jacques https://issues.apache.org/jira/browse/DRILL-4278
> > > Ready to merge Laurent
> https://issues.apache.org/jira/browse/DRILL-4285
> > > Nice to Have
> > > Open Jason/Hakim https://issues.apache.org/jira/browse/DRILL-4247
> > > In-progress Jason https://issues.apache.org/jira/browse/DRILL-4203
> > > Open Jacques https://issues.apache.org/jira/browse/DRILL-4266
> > > Ready to merge Jacques
> https://issues.apache.org/jira/browse/DRILL-4131
> > >
> > > What do others think? Let's try to get the blockers wrapped up in the
> > next
> > > day or two and start a release vote...
> > >
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Mon, Jan 4, 2016 at 1:48 PM, Jason Altekruse <
> > altekruseja...@gmail.com>
> > > wrote:
> > >
> > >> Hello All,
> > >>
> > >> With the allocator changes merged and about a month since the last
> > release
> > >> I think it would be good to start a vote soon. I would like to
> > volunteer to
> > >> be release manager.
> > >>
> > >> I know that there were some issues that were identified after the
> > transfer
> > >> patch was merged. I think that these issues should be fixed before we
> > cut a
> > >> release candidate.
> > >>
> > >> From looking at the associated JIRAs it looked like there was a
> possible
> > >> short term fix just adjusting the max_query_memory_per_node option,
> and
> > >> some more involved work to change how we determine the correct time to
> > >> spill during external sort. I believe it makes sense to make external
> > sort
> > >> work well with the newly improved memory accounting before cutting a
> > >> release, but I'm not sure how much work is left to be done there. [1]
> > >>
> > >> Please respond with your thoughts on a release soon and any JIRAs you
> > would
> > >> like to include in the release.
> > >>
> > >> [1] - https://issues.apache.org/jira/browse/DRILL-4243
> > >>
> > >> Thanks,
> > >> Jason
> > >>
> >
>

Re: Out Of Memory Error (Possible Regression)

2015-12-30 Thread Steven Phillips

I didn't see any tests running out of memory. Which tests are you seeing
with this?

On Wed, Dec 30, 2015 at 1:37 PM, Abdel Hakim Deneche 
wrote:

> Steven,
>
> were you able to successfully run the regression tests on the transfer
> patch ? I just tried and saw several queries running out of memory !
>
> On Wed, Dec 30, 2015 at 11:46 AM, Abdel Hakim Deneche <
> adene...@maprtech.com
> > wrote:
>
> > Created DRILL-4236  to
> > keep track of this improvement.
> >
> > On Wed, Dec 30, 2015 at 11:01 AM, Jacques Nadeau 
> > wrote:
> >
> >> Since the accounting changed (more accurate), the termination condition
> >> for
> >> the sort operator will be different than before. In fact, this likely
> will
> >> be sooner since our accounting is much larger than previously (since we
> >> correctly consider the entire allocation rather than simply the used
> >> allocation).
> >>
> >> Hakim,
> >> Steven and I were discussing the need to update the ExternalSort
> operator
> >> to use the new allocator functionality to better manage its memory
> >> envelope. Would you be interested in working on this since you seem to
> be
> >> working with that code the most? Basically, it used to be that there was
> >> no
> >> way the sort operator would be able to correctly detect a memory
> condition
> >> and so it jumped through a bunch of hoops to try to figure out the
> >> termination condition.With the transfer accounting in place, this code
> can
> >> be greatly simplified to just use the current operator memory
> allocation.
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Wed, Dec 30, 2015 at 10:48 AM, rahul challapalli <
> >> challapallira...@gmail.com> wrote:
> >>
> >> > I installed the latest master and ran this query. So
> >> > planner.memory.max_query_memory_per_node should have been the default
> >> > value. I switched back to 1.4.0 branch and this query completed
> >> > successfully.
> >> >
> >> > On Wed, Dec 30, 2015 at 10:37 AM, Abdel Hakim Deneche <
> >> > adene...@maprtech.com
> >> > > wrote:
> >> >
> >> > > Rahul,
> >> > >
> >> > > How much memory was assigned to the sort operator (
> >> > > planner.memory.max_query_memory_per_node) ?
> >> > >
> >> > > On Wed, Dec 30, 2015 at 9:54 AM, rahul challapalli <
> >> > > challapallira...@gmail.com> wrote:
> >> > >
> >> > > > I am seeing an OOM error while executing a simple CTAS query. I
> >> raised
> >> > > > DRILL-4324 for this. The query mentioned in the JIRA used to
> >> complete
> >> > > > successfully without any issue prior to 1.5. Any idea what could
> >> have
> >> > > > caused the regression?
> >> > > >
> >> > > > - Rahul
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Abdelhakim Deneche
> >> > >
> >> > > Software Engineer
> >> > >
> >> > >   
> >> > >
> >> > >
> >> > > Now Available - Free Hadoop On-Demand Training
> >> > > <
> >> > >
> >> >
> >>
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
>

[jira] [Created] (DRILL-4215) Transfer ownership of buffers when doing transfers

2015-12-21 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4215:
--

 Summary: Transfer ownership of buffers when doing transfers
 Key: DRILL-4215
 URL: https://issues.apache.org/jira/browse/DRILL-4215
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips


The new allocator has the feature of allowing the transfer of ownership of 
buffers from one allocator to another. We should make use of this feature by 
transferring ownership whenever we transfer buffers between vectors. This will 
allow better tracking of how much memory operators are holding on to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Unittest failure on master

2015-12-15 Thread Steven Phillips

To clarify, I was suggesting that the order in which the files are read
could be causing the variation in result, not that this is expected. There
definitely seems to be a bug. But the fact that it passes sometimes and not
others suggests the problem is exposed by file ordering.

On Tue, Dec 15, 2015 at 3:31 PM, Amit Hadke <amit.ha...@gmail.com> wrote:

> Jason,
>
> I misunderstood earlier why unit test is failing. It has nothing to do with
> ordering of files.
>
> Whats happening is I'm doing topn operation on field which is union of
> strings and nulls in descending order.
> Test checks if string values are on top but somehow for some people nulls
> are on top and test fails.
>
> I'm suspecting it has to do with how comparator treats null - high/low.
>
> ~ Amit.
>
>
>
> On Tue, Dec 15, 2015 at 3:23 PM, Jason Altekruse <altekruseja...@gmail.com
> >
> wrote:
>
> > Amit,
> >
> > The message out of the test framework tries to provide enough information
> > to debug even if the issues isn't reproducible in your environment. Can
> you
> > think of any reason why it might be giving the different results shown in
> > the message if the order of the batches changed?
> >
> > If you need to change the order yourself there are two hacky approaches
> you
> > could do. Try changing the names or saving the files in a different order
> >  to make the FS give them back to you in a different order. You also
> could
> > just combine together the files and adjust the batch cutoff number used
> in
> > the json reader, with various ordering of the records in different
> versions
> > of the dataset.
> >
> > As I write this I realize that combining the files will change the
> behavior
> > of the read. with the first batch giving a single type and later ones
> > giving a union type. As opposed to the multiple files approach which
> would
> > produce a bunch of different individual types and make the sort operation
> > generate the union type. To test this properly we may just need a test
> > harness to produce batches explicitly and feed them into an operator,
> > rather than relying on the JSON reader.
> >
> > - Jason
> >
> > On Tue, Dec 15, 2015 at 2:31 PM, Amit Hadke <amit.ha...@gmail.com>
> wrote:
> >
> > > Hey Guys,
> > >
> > > I'm not able to reproduce same issue and test doesn't seem to be doing
> > > anything.
> > >
> > > Can someone run "mvn -Dtest=TestTopNSchemaChanges#testMissingColumn
> test"
> > > and see if it fails?
> > >
> > > On Mon, Dec 14, 2015 at 11:51 PM, Amit Hadke <amit.ha...@gmail.com>
> > wrote:
> > >
> > > > This seems like  a bug in topn code than test.
> > > > We are expecting sorted by kl2 (descending) so that non null values
> > come
> > > > up on top.
> > > > Results seems to be have nulls on top.
> > > >
> > > > ~ Amit.
> > > >
> > > > On Mon, Dec 14, 2015 at 11:27 PM, Jason Altekruse <
> > > > altekruseja...@gmail.com> wrote:
> > > >
> > > >> Seems weird that the results would be different based on reading
> > order,
> > > as
> > > >> the queries themselves contain an order by. Do we return different
> > types
> > > >> out of the sort depending on which schema we get first? Is this
> > > >> intentional?
> > > >>
> > > >> - Jason
> > > >>
> > > >> On Mon, Dec 14, 2015 at 6:06 PM, Steven Phillips <ste...@dremio.com
> >
> > > >> wrote:
> > > >>
> > > >> > I just did a build a linux box, and didn't see this failure. My
> > guess
> > > is
> > > >> > that it fails depending on which order the files are read.
> > > >> >
> > > >> > On Mon, Dec 14, 2015 at 5:38 PM, Venki Korukanti <
> > > >> > venki.koruka...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Is anyone else seeing below failure on latest master? I am
> running
> > > it
> > > >> on
> > > >> > > Linux.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> testMissingColumn(org.apache.drill.exec.physical.impl.TopN.TestTopNSchemaChanges)
> > > >> > >  Time elapsed: 2.537 sec  <<< ERROR!
> >

Re: Unittest failure on master

2015-12-14 Thread Steven Phillips

I just did a build a linux box, and didn't see this failure. My guess is
that it fails depending on which order the files are read.

On Mon, Dec 14, 2015 at 5:38 PM, Venki Korukanti 
wrote:

> Is anyone else seeing below failure on latest master? I am running it on
> Linux.
>
>
> testMissingColumn(org.apache.drill.exec.physical.impl.TopN.TestTopNSchemaChanges)
>  Time elapsed: 2.537 sec  <<< ERROR!
> java.lang.Exception: unexpected null at position 0 column '`vl2`' should
> have been:  299
>
> Expected Records near verification failure:
> Record Number: 0 { `kl1` : null,`kl2` : 299,`vl2` : 299,`vl1` : null,`vl` :
> null,`kl` : null, }
> Record Number: 1 { `kl1` : null,`kl2` : 298,`vl2` : 298,`vl1` : null,`vl` :
> null,`kl` : null, }
> Record Number: 2 { `kl1` : null,`kl2` : 297,`vl2` : 297,`vl1` : null,`vl` :
> null,`kl` : null, }
>
>
> Actual Records near verification failure:
> Record Number: 0 { `kl1` : null,`vl2` : null,`kl2` : null,`vl1` : null,`vl`
> : 100.0,`kl` : 100.0, }
> Record Number: 1 { `kl1` : null,`vl2` : null,`kl2` : null,`vl1` : null,`vl`
> : 101.0,`kl` : 101.0, }
> Record Number: 2 { `kl1` : null,`vl2` : null,`kl2` : null,`vl1` : null,`vl`
> : 102.0,`kl` : 102.0, }
>
> For query: select kl, vl, kl1, vl1, kl2, vl2 from
>
> dfs_test.`/root/drill/exec/java-exec/target/1450142361702-0/topn-schemachanges`
> order by kl2 desc limit 3
> at
>
> org.apache.drill.DrillTestWrapper.compareValuesErrorOnMismatch(DrillTestWrapper.java:512)
> at
>
> org.apache.drill.DrillTestWrapper.compareMergedVectors(DrillTestWrapper.java:170)
> at
>
> org.apache.drill.DrillTestWrapper.compareMergedOnHeapVectors(DrillTestWrapper.java:397)
> at
>
> org.apache.drill.DrillTestWrapper.compareOrderedResults(DrillTestWrapper.java:352)
> at org.apache.drill.DrillTestWrapper.run(DrillTestWrapper.java:124)
> at org.apache.drill.TestBuilder.go(TestBuilder.java:129)
> at
>
> org.apache.drill.exec.physical.impl.TopN.TestTopNSchemaChanges.testMissingColumn(TestTopNSchemaChanges.java:206)
>
>
> Results :
>
> Tests in error:
>   TestTopNSchemaChanges.testMissingColumn:206 »  unexpected null at
> position 0 c...
>
> Tests run: 4, Failures: 0, Errors: 1, Skipped: 0
>

Re: [VOTE] Release Apache Drill 1.4.0 RC1

2015-12-10 Thread Steven Phillips

+1 (binding)

Downloaded tarballs.

Verified checksums, verified build

On Thu, Dec 10, 2015 at 2:29 PM, Jinfeng Ni  wrote:

> +1 (binding)
>
> Downloaded src tarball and build from source.
> Start drillbit in standalone mode.
> Run all the queries in yelp tutorial from drill doc.
> Run couple of tpcds queries against scale factor 1 sample dataset.
> Test query cancel.
> Run couple of tpch queries through WebUI.
>
> LGTM.
>
>
>
> On Wed, Dec 9, 2015 at 10:35 PM, Parth Chandra  wrote:
> > LGTM. +1 (binding)
> >
> > Downloaded src, validated checksums.
> > Built from src tarball.
> > Built C++ client.
> > Tested multiple simultaneous simple queries using C++ query submitter
> using
> > both sync/async APIs. Tested cancel.
> >
> >
> >
> > On Tue, Dec 8, 2015 at 7:54 AM, Venki Korukanti <
> venki.koruka...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I'd like to propose the second release candidate of Apache Drill,
> version
> >> 1.4.0. It covers a total of 32 resolved JIRAs [1]. Fix for MergeJoin
> issue
> >> (DRILL-4165) found in RC0 is also included in RC1. Thanks to everyone
> who
> >> contributed to this release.
> >>
> >> The tarball artifacts are hosted at [2] and the maven artifacts are
> hosted
> >> at
> >> [3]. This release candidate is based on commit
> >> 32b871b24c7b69f59a1d2e70f444eed6e599e825 located at [4].
> >>
> >> The vote will be open for the next 72 hours ending at 8AM Pacific,
> December
> >> 11, 2015.
> >>
> >> [ ] +1
> >> [ ] +0
> >> [ ] -1
> >>
> >> Thanks
> >> Venki
> >>
> >> [1] *
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332947=12313820
> >> <
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332947=12313820
> >> >*
> >> [2] http://people.apache.org/~venki/apache-drill-1.4.0.rc1
> >> [3] *
> >> https://repository.apache.org/content/repositories/orgapachedrill-1019/
> >> <
> https://repository.apache.org/content/repositories/orgapachedrill-1019/>*
> >> [4] https://github.com/vkorukanti/drill/tree/1.4.0
> >>
>

Re: [VOTE] Release Apache Drill 1.4.0 RC0

2015-12-07 Thread Steven Phillips

-1

the bug the Aman found is serious and should be fixed. Producing batches of
size greater than 64K could lead to some wrong results.

On Mon, Dec 7, 2015 at 2:26 PM, Abdel Hakim Deneche 
wrote:

> Although I got a clean run on my linux VM, I am seeing the following error
> on my Macbook consistently:
>
> ERROR] Failed to execute goal
> org.codehaus.mojo:sql-maven-plugin:1.5:execute (create-tables) on project
> drill-jdbc-storage: Communications link failure
> Here are more details about the error:
>
> https://gist.github.com/adeneche/93bd0451538071703e2a
>
> Anyone else seeing this ?
>
>
> On Mon, Dec 7, 2015 at 8:29 AM, Aman Sinha  wrote:
>
> > +1  (binding)
> >
> > -Downloaded source and built on Mac
> > -Ran unit tests successfully
> > -Ran several manual tests:
> >   - inner-join test with merge join. Found a bug..filed DRILL-4165.  It
> > does not seem to be a regression but clearly needs
> > to be fixed soon.
> > -Ran manual tests against parquet partitioned data with and without
> > metadata cache
> > -Examined Explain plans for a few queries with partition filters on
> BigInt
> > columns, verified partition pruning was working
> > -Examined query profiles in Web UI
> >
> > On Fri, Dec 4, 2015 at 9:41 PM, Venki Korukanti <
> venki.koruka...@gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I'd like to propose the first release candidate of Apache Drill,
> version
> > > 1.4.0. It covers a total of 31 resolved JIRAs [1]. Thanks to everyone
> who
> > > contributed to this release.
> > >
> > > The tarball artifacts are hosted at [2] and the maven artifacts are
> > hosted
> > > at
> > > [3]. This release candidate is based on commit
> > > 5aace39b282c7ac34366d650cb91d555ef23c64b located at [4].
> > >
> > > The vote will be open for the next 72 hours ending at 10PM Pacific,
> > > December 7, 2015.
> > >
> > > [ ] +1
> > > [ ] +0
> > > [ ] -1
> > >
> > > Thanks
> > > Venki
> > >
> > > [1]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332947=12313820
> > > [2] http://people.apache.org/~venki/apache-drill-1.4.0.rc0/
> > > [3]
> > >
> https://repository.apache.org/content/repositories/orgapachedrill-1018/
> > > [4] https://github.com/vkorukanti/drill/tree/1.4.0
> > >
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
>

Re: Create 1.4 Release branch soon?

2015-12-03 Thread Steven Phillips

I just ran the tests on a linux machine, and did not see this failure. Do
you see it consistently?

On Wed, Dec 2, 2015 at 10:20 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote:

> I run mvn full build on linux box against the latest master branch.
> There was one unit test failure. However, on mac, it's successful. Has
> anyone experienced the same?
>
> Failed tests:
>   TestCsvHeader.testCsvHeaderMismatch:151->validateResults:196 Result
> mismatch.
> Expected:
> Year|Model|Category
> 1999|Venture "Extended Edition"|
> 1999|Venture "Extended Edition, Very Large"|
> Year|Model|Category
> 1999||Venture "Extended Edition"
> 1999||Venture "Extended Edition, Very Large"
>
> Received:
> Year|Model|Category
> 1999||Venture "Extended Edition"
> 1999||Venture "Extended Edition, Very Large"
> Year|Model|Category
> 1999|Venture "Extended Edition"|
> 1999|Venture "Extended Edition, Very Large"|
>  expected:<...Model|Category
> 1999|[Venture "Extended Edition"|
> 1999|Venture "Extended Edition, Very Large"|
> Year|Model|Category
> 1999||Venture "Extended Edition"
> 1999||Venture "Extended Edition, Very Large"]
> > but was:<...Model|Category
> 1999|[|Venture "Extended Edition"
> 1999||Venture "Extended Edition, Very Large"
> Year|Model|Category
> 1999|Venture "Extended Edition"|
> 1999|Venture "Extended Edition, Very Large"|]
> >
>
> Tests run: 1505, Failures: 1, Errors: 0, Skipped: 121
>
> git log
> commit 3ae3bf5e127b4384c7b91d797d36ea4d51a058ae
>
>
> On Wed, Dec 2, 2015 at 7:21 PM, Jacques Nadeau <jacq...@dremio.com> wrote:
> > I think we should roll forward to 1.5-S..
> > On Dec 2, 2015 6:44 PM, "Venki Korukanti" <venki.koruka...@gmail.com>
> wrote:
> >
> >> 1.4.0 branch is cut and available here:
> >> https://github.com/vkorukanti/drill/tree/1.4.0.
> >>
> >> Should I move the master to 1.5.0-SNAPSHOT now or wait until an RC is
> >> passed?
> >>
> >> Thanks
> >> Venki
> >>
> >> On Wed, Dec 2, 2015 at 4:25 PM, Venki Korukanti <
> venki.koruka...@gmail.com
> >> >
> >> wrote:
> >>
> >> > For DRILL-4109 and DRILL-4125, Vicky is not available today to
> verify. If
> >> > the changes are reviewed lets merge them today. Once the branch is cut
> >> > today, MapR will do the release sanity for next couple of days before
> RC0
> >> > voting goes out. If any issues are found, we still have time to fix
> >> before
> >> > the RC0 voting.
> >> >
> >> > Thanks
> >> > Venki
> >> >
> >> > On Wed, Dec 2, 2015 at 4:02 PM, Jacques Nadeau <jacq...@dremio.com>
> >> wrote:
> >> >
> >> >> Sounds good.
> >> >>
> >> >> --
> >> >> Jacques Nadeau
> >> >> CTO and Co-Founder, Dremio
> >> >>
> >> >> On Wed, Dec 2, 2015 at 2:47 PM, Steven Phillips <ste...@dremio.com>
> >> >> wrote:
> >> >>
> >> >> > Okay, I'm going to go ahead and merge DRILL-4145
> >> >> >
> >> >> > On Wed, Dec 2, 2015 at 2:45 PM, Venki Korukanti <
> >> >> venki.koruka...@gmail.com
> >> >> > >
> >> >> > wrote:
> >> >> >
> >> >> > > For DRILL-4145: ran the regression suite which includes customer
> and
> >> >> > > extended tests. No regressions found.
> >> >> > >
> >> >> > > On Wed, Dec 2, 2015 at 1:41 PM, Venki Korukanti <
> >> >> > venki.koruka...@gmail.com
> >> >> > > >
> >> >> > > wrote:
> >> >> > >
> >> >> > > > Sure, I will trigger a regression run with DRILL-4145.
> >> >> > > >
> >> >> > > > Thanks
> >> >> > > > Venki
> >> >> > > >
> >> >> > > > On Wed, Dec 2, 2015 at 1:36 PM, Jacques Nadeau <
> >> jacq...@dremio.com>
> >> >> > > wrote:
> >> >> > > >
> >> >> > > >> I believe 4109 is ready but if I understand correctly, it
> can't
> >> go
> >> >> in
> >> >> > > >> without 4125. Amit said he needs another hour for that.
> >

Re: Create 1.4 Release branch soon?

2015-12-03 Thread Steven Phillips

I just pushed a fix to the test that I am pretty confident resolves the
problem that Jin Feng was hitting. Jin Feng, can you confirm this fixes
your problem?

On Thu, Dec 3, 2015 at 11:52 AM, Venki Korukanti <venki.koruka...@gmail.com>
wrote:

> As the changes are only in test, it should be ok if we get the fix after
> lunch. Currently the branch is going through regression testing.
>
> On Thu, Dec 3, 2015 at 9:47 AM, Steven Phillips <ste...@dremio.com> wrote:
>
> > Okay, after looking at it more closely, it looks like its an ordering
> > problem. We should rewrite the tests using the test framework, and be
> sure
> > to set the validation as unordered.
> >
> > I can take care of this, but won't get to it until after lunch. If
> someone
> > else wants to do it now in order to get an RC out sooner, feel free.
> >
> > On Thu, Dec 3, 2015 at 8:39 AM, Jinfeng Ni <jinfengn...@gmail.com>
> wrote:
> >
> > > I switch to a different linux box, and hit the same error. So, seems
> > > it's consistent from what I tried.
> > >
> > >
> > > On Thu, Dec 3, 2015 at 7:58 AM, Jinfeng Ni <jinfengn...@gmail.com>
> > wrote:
> > > > I run twice and hit the same error.
> > > >
> > > >
> > > > On Thu, Dec 3, 2015 at 12:10 AM, Steven Phillips <ste...@dremio.com>
> > > wrote:
> > > >> I just ran the tests on a linux machine, and did not see this
> failure.
> > > Do
> > > >> you see it consistently?
> > > >>
> > > >> On Wed, Dec 2, 2015 at 10:20 PM, Jinfeng Ni <jinfengn...@gmail.com>
> > > wrote:
> > > >>
> > > >>> I run mvn full build on linux box against the latest master branch.
> > > >>> There was one unit test failure. However, on mac, it's successful.
> > Has
> > > >>> anyone experienced the same?
> > > >>>
> > > >>> Failed tests:
> > > >>>   TestCsvHeader.testCsvHeaderMismatch:151->validateResults:196
> Result
> > > >>> mismatch.
> > > >>> Expected:
> > > >>> Year|Model|Category
> > > >>> 1999|Venture "Extended Edition"|
> > > >>> 1999|Venture "Extended Edition, Very Large"|
> > > >>> Year|Model|Category
> > > >>> 1999||Venture "Extended Edition"
> > > >>> 1999||Venture "Extended Edition, Very Large"
> > > >>>
> > > >>> Received:
> > > >>> Year|Model|Category
> > > >>> 1999||Venture "Extended Edition"
> > > >>> 1999||Venture "Extended Edition, Very Large"
> > > >>> Year|Model|Category
> > > >>> 1999|Venture "Extended Edition"|
> > > >>> 1999|Venture "Extended Edition, Very Large"|
> > > >>>  expected:<...Model|Category
> > > >>> 1999|[Venture "Extended Edition"|
> > > >>> 1999|Venture "Extended Edition, Very Large"|
> > > >>> Year|Model|Category
> > > >>> 1999||Venture "Extended Edition"
> > > >>> 1999||Venture "Extended Edition, Very Large"]
> > > >>> > but was:<...Model|Category
> > > >>> 1999|[|Venture "Extended Edition"
> > > >>> 1999||Venture "Extended Edition, Very Large"
> > > >>> Year|Model|Category
> > > >>> 1999|Venture "Extended Edition"|
> > > >>> 1999|Venture "Extended Edition, Very Large"|]
> > > >>> >
> > > >>>
> > > >>> Tests run: 1505, Failures: 1, Errors: 0, Skipped: 121
> > > >>>
> > > >>> git log
> > > >>> commit 3ae3bf5e127b4384c7b91d797d36ea4d51a058ae
> > > >>>
> > > >>>
> > > >>> On Wed, Dec 2, 2015 at 7:21 PM, Jacques Nadeau <jacq...@dremio.com
> >
> > > wrote:
> > > >>> > I think we should roll forward to 1.5-S..
> > > >>> > On Dec 2, 2015 6:44 PM, "Venki Korukanti" <
> > venki.koruka...@gmail.com
> > > >
> > > >>> wrote:
> > > >>> >
> > > >>> >> 1.4.0 branch is cut and available here:
> > > >>> >> https://github.com/vkorukanti/drill/tree/1.4.0.

[jira] [Created] (DRILL-4159) TestCsvHeader sometimes fails due to ordering issue

2015-12-03 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4159:
--

 Summary: TestCsvHeader sometimes fails due to ordering issue
 Key: DRILL-4159
 URL: https://issues.apache.org/jira/browse/DRILL-4159
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips


This test should be rewritten to use the query test framework, rather than 
doing a string comparison of the entire result set. And it should be specified 
as unordered, so that results aren't affected by the random order in which 
files are read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Create 1.4 Release branch soon?

2015-12-03 Thread Steven Phillips

Okay, after looking at it more closely, it looks like its an ordering
problem. We should rewrite the tests using the test framework, and be sure
to set the validation as unordered.

I can take care of this, but won't get to it until after lunch. If someone
else wants to do it now in order to get an RC out sooner, feel free.

On Thu, Dec 3, 2015 at 8:39 AM, Jinfeng Ni <jinfengn...@gmail.com> wrote:

> I switch to a different linux box, and hit the same error. So, seems
> it's consistent from what I tried.
>
>
> On Thu, Dec 3, 2015 at 7:58 AM, Jinfeng Ni <jinfengn...@gmail.com> wrote:
> > I run twice and hit the same error.
> >
> >
> > On Thu, Dec 3, 2015 at 12:10 AM, Steven Phillips <ste...@dremio.com>
> wrote:
> >> I just ran the tests on a linux machine, and did not see this failure.
> Do
> >> you see it consistently?
> >>
> >> On Wed, Dec 2, 2015 at 10:20 PM, Jinfeng Ni <jinfengn...@gmail.com>
> wrote:
> >>
> >>> I run mvn full build on linux box against the latest master branch.
> >>> There was one unit test failure. However, on mac, it's successful. Has
> >>> anyone experienced the same?
> >>>
> >>> Failed tests:
> >>>   TestCsvHeader.testCsvHeaderMismatch:151->validateResults:196 Result
> >>> mismatch.
> >>> Expected:
> >>> Year|Model|Category
> >>> 1999|Venture "Extended Edition"|
> >>> 1999|Venture "Extended Edition, Very Large"|
> >>> Year|Model|Category
> >>> 1999||Venture "Extended Edition"
> >>> 1999||Venture "Extended Edition, Very Large"
> >>>
> >>> Received:
> >>> Year|Model|Category
> >>> 1999||Venture "Extended Edition"
> >>> 1999||Venture "Extended Edition, Very Large"
> >>> Year|Model|Category
> >>> 1999|Venture "Extended Edition"|
> >>> 1999|Venture "Extended Edition, Very Large"|
> >>>  expected:<...Model|Category
> >>> 1999|[Venture "Extended Edition"|
> >>> 1999|Venture "Extended Edition, Very Large"|
> >>> Year|Model|Category
> >>> 1999||Venture "Extended Edition"
> >>> 1999||Venture "Extended Edition, Very Large"]
> >>> > but was:<...Model|Category
> >>> 1999|[|Venture "Extended Edition"
> >>> 1999||Venture "Extended Edition, Very Large"
> >>> Year|Model|Category
> >>> 1999|Venture "Extended Edition"|
> >>> 1999|Venture "Extended Edition, Very Large"|]
> >>> >
> >>>
> >>> Tests run: 1505, Failures: 1, Errors: 0, Skipped: 121
> >>>
> >>> git log
> >>> commit 3ae3bf5e127b4384c7b91d797d36ea4d51a058ae
> >>>
> >>>
> >>> On Wed, Dec 2, 2015 at 7:21 PM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
> >>> > I think we should roll forward to 1.5-S..
> >>> > On Dec 2, 2015 6:44 PM, "Venki Korukanti" <venki.koruka...@gmail.com
> >
> >>> wrote:
> >>> >
> >>> >> 1.4.0 branch is cut and available here:
> >>> >> https://github.com/vkorukanti/drill/tree/1.4.0.
> >>> >>
> >>> >> Should I move the master to 1.5.0-SNAPSHOT now or wait until an RC
> is
> >>> >> passed?
> >>> >>
> >>> >> Thanks
> >>> >> Venki
> >>> >>
> >>> >> On Wed, Dec 2, 2015 at 4:25 PM, Venki Korukanti <
> >>> venki.koruka...@gmail.com
> >>> >> >
> >>> >> wrote:
> >>> >>
> >>> >> > For DRILL-4109 and DRILL-4125, Vicky is not available today to
> >>> verify. If
> >>> >> > the changes are reviewed lets merge them today. Once the branch
> is cut
> >>> >> > today, MapR will do the release sanity for next couple of days
> before
> >>> RC0
> >>> >> > voting goes out. If any issues are found, we still have time to
> fix
> >>> >> before
> >>> >> > the RC0 voting.
> >>> >> >
> >>> >> > Thanks
> >>> >> > Venki
> >>> >> >
> >>> >> > On Wed, Dec 2, 2015 at 4:02 PM, Jacques Nadeau <
> jacq...@dremio.com>
> >>> >> wrote:
> >>&g

Re: Create 1.4 Release branch soon?

2015-12-02 Thread Steven Phillips

Okay, I'm going to go ahead and merge DRILL-4145

On Wed, Dec 2, 2015 at 2:45 PM, Venki Korukanti 
wrote:

> For DRILL-4145: ran the regression suite which includes customer and
> extended tests. No regressions found.
>
> On Wed, Dec 2, 2015 at 1:41 PM, Venki Korukanti  >
> wrote:
>
> > Sure, I will trigger a regression run with DRILL-4145.
> >
> > Thanks
> > Venki
> >
> > On Wed, Dec 2, 2015 at 1:36 PM, Jacques Nadeau 
> wrote:
> >
> >> I believe 4109 is ready but if I understand correctly, it can't go in
> >> without 4125. Amit said he needs another hour for that.
> >>
> >> On 4145: Venki, can someone on your side running any extended/customer
> >> tests you have against this to make sure that it doesn't cause any
> >> regressions. We've run on our side without issue but I'd like to get as
> >> broad of coverage as possible to ensure no issues.
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Wed, Dec 2, 2015 at 11:52 AM, Venki Korukanti <
> >> venki.koruka...@gmail.com>
> >> wrote:
> >>
> >> > Sure, we can include DRILL-4124. I will merge it soon. Any idea when
> the
> >> > following JIRAs are going to be ready to commit?
> >> >
> >> > DRILL-4109 (in review)
> >> > DRILL-4125
> >> > DRILL-4145 (in review)
> >> >
> >> > Jinfeng and I discussed about including metastore caching and decided
> to
> >> > move it to 1.5.
> >> >
> >> > Lets make 5pm as the cutoff time for 1.4 branch.
> >> >
> >> > Thanks
> >> > Venki
> >> >
> >> > On Wed, Dec 2, 2015 at 11:42 AM, Julien Le Dem 
> >> wrote:
> >> >
> >> > > DRILL-4124 is ready to merge:
> >> https://github.com/apache/drill/pull/281
> >> > > It is a small change.
> >> > >
> >> > > On Wed, Dec 2, 2015 at 7:58 AM, Venki Korukanti <
> >> > venki.koruka...@gmail.com
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > I posted a comment on the commit DRILL-4126 (for some reason no
> >> emails
> >> > > are
> >> > > > sent to dev list, may be because of the multiple commits in the
> pull
> >> > > > request).
> >> > > >
> >> > > > On Wed, Dec 2, 2015 at 7:53 AM, Jinfeng Ni  >
> >> > > wrote:
> >> > > >
> >> > > > > If the window is still open, I would like to have DRILL-4126,
> >> > > > > DRILL-4127 merged as well.
> >> > > > >
> >> > > > > Venki, could you please review the revised patch?
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Dec 2, 2015 at 7:49 AM, Venki Korukanti
> >> > > > >  wrote:
> >> > > > > > Remaining JIRAs:
> >> > > > > >
> >> > > > > > DRILL-4109 (in review)
> >> > > > > > DRILL-4125
> >> > > > > > DRILL-4145 (in review)
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Dec 1, 2015 at 10:45 PM, Venki Korukanti <
> >> > > > > venki.koruka...@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > >> Sure, will merge DRILL-4111.
> >> > > > > >>
> >> > > > > >> On Tue, Dec 1, 2015 at 10:43 PM, Sudheesh Katkam <
> >> > > > skat...@maprtech.com>
> >> > > > > >> wrote:
> >> > > > > >>
> >> > > > > >>> Venki, can you please commit DRILL-4111?
> >> > > > > >>>
> >> > > > > >>> Thank you,
> >> > > > > >>> Sudheesh
> >> > > > > >>>
> >> > > > > >>> > On Dec 1, 2015, at 6:22 PM, Jacques Nadeau <
> >> jacq...@dremio.com
> >> > >
> >> > > > > wrote:
> >> > > > > >>> >
> >> > > > > >>> > It seems like we should also try to include:
> >> > > > > >>> >
> >> > > > > >>> > DRILL-4109
> >> > > > > >>> > DRILL-4125
> >> > > > > >>> > DRILL-4111
> >> > > > > >>> >
> >> > > > > >>> > 4111 is ready to merge if someone wants to merge it. It
> >> looks
> >> > > like
> >> > > > > >>> Sudheesh
> >> > > > > >>> > just needs to commit.
> >> > > > > >>> >
> >> > > > > >>> > For 4125/4109, it seems like we should get Vicki's
> feedback
> >> if
> >> > > > those
> >> > > > > >>> > changes are good for merge or if we should punt.
> >> > > > > >>> >
> >> > > > > >>> >
> >> > > > > >>> >
> >> > > > > >>> >
> >> > > > > >>> > --
> >> > > > > >>> > Jacques Nadeau
> >> > > > > >>> > CTO and Co-Founder, Dremio
> >> > > > > >>> >
> >> > > > > >>> > On Tue, Dec 1, 2015 at 3:59 PM, Venki Korukanti <
> >> > > > > >>> venki.koruka...@gmail.com>
> >> > > > > >>> > wrote:
> >> > > > > >>> >
> >> > > > > >>> >> I can manage the release.
> >> > > > > >>> >>
> >> > > > > >>> >> Here are the pending JIRAs so far:
> >> > > > > >>> >>
> >> > > > > >>> >> 1) DRILL-4053
> >> > > > > >>> >>
> >> > > > > >>> >> Let me know if you have any pending patches and want to
> get
> >> > them
> >> > > > > into
> >> > > > > >>> 1.4.
> >> > > > > >>> >>
> >> > > > > >>> >> Thanks
> >> > > > > >>> >> Venki
> >> > > > > >>> >>
> >> > > > > >>> >> On Tue, Dec 1, 2015 at 10:43 AM, Parth Chandra <
> >> > > par...@apache.org
> >> > > > >
> >> > > > > >>> wrote:
> >> > > > > >>> >>
> >> > > > > >>> >>> +1 on creating the release branch today.
> >> > > > > >>> >>> I'd like to get DRILL-4053 in. Patch is ready and doing
> >> one
> >> > >

Re: Create 1.4 Release branch soon?

2015-12-01 Thread Steven Phillips

Sure, I'll see if I can merge 4108, and look into 4145.

On Tue, Dec 1, 2015 at 10:10 PM, Jacques Nadeau  wrote:

> It seems like 4108 and 4145 should also be addressed. Steven, can you take
> a look at trying to get these merged/resolved?  (4145 might be related to
> 4108 or is otherwise an off-by-one issue it seems.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Dec 1, 2015 at 6:22 PM, Jacques Nadeau  wrote:
>
> > It seems like we should also try to include:
> >
> > DRILL-4109
> > DRILL-4125
> > DRILL-4111
> >
> > 4111 is ready to merge if someone wants to merge it. It looks like
> > Sudheesh just needs to commit.
> >
> > For 4125/4109, it seems like we should get Vicki's feedback if those
> > changes are good for merge or if we should punt.
> >
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Dec 1, 2015 at 3:59 PM, Venki Korukanti <
> venki.koruka...@gmail.com
> > > wrote:
> >
> >> I can manage the release.
> >>
> >> Here are the pending JIRAs so far:
> >>
> >> 1) DRILL-4053
> >>
> >> Let me know if you have any pending patches and want to get them into
> 1.4.
> >>
> >> Thanks
> >> Venki
> >>
> >> On Tue, Dec 1, 2015 at 10:43 AM, Parth Chandra 
> wrote:
> >>
> >> > +1 on creating the release branch today.
> >> > I'd like to get DRILL-4053 in. Patch is ready and doing one more round
> >> of
> >> > regression tests.
> >> >
> >> >
> >> >
> >> > On Tue, Dec 1, 2015 at 9:56 AM, Jacques Nadeau 
> >> wrote:
> >> >
> >> > > It is December (already!). Seems like we should create a 1.4 release
> >> > branch
> >> > > soon so we can get a vote going. Does someone want to volunteer as
> the
> >> > > release manager?
> >> > >
> >> > > thanks,
> >> > > Jacques
> >> > >
> >> > > --
> >> > > Jacques Nadeau
> >> > > CTO and Co-Founder, Dremio
> >> > >
> >> >
> >>
> >
> >
>

Re: ExternalSort doesn't properly account for sliced buffers

2015-11-20 Thread Steven Phillips

I think it is because we can't actually properly account for sliced
buffers. I don't remember for sure, but I think it might be because calling
buf.capacity() on a sliced buffer returns the the capacity of root buffer,
not the size of the slice. That may not be correct, but I think it was
something like that. Whatever it is, I am pretty sure it was giving wrong
results when they are sliced buffers.

I think we need to get the new allocator, along with proper transfer of
ownership in order to do this correctly. Then we can just query the
allocator rather than trying to track it separately.

On Fri, Nov 20, 2015 at 11:25 AM, Abdel Hakim Deneche  wrote:

> I'm looking at the external sort code and it uses the following method to
> compute the allocated size of a batch:
>
>   private long getBufferSize(VectorAccessible batch) {
> > long size = 0;
> > for (VectorWrapper w : batch) {
> >   DrillBuf[] bufs = w.getValueVector().getBuffers(false);
> >   for (DrillBuf buf : bufs) {
> > if (*buf.isRootBuffer()*) {
> >   size += buf.capacity();
> > }
> >   }
> > }
> > return size;
> >   }
>
>
> This method only accounts for root buffers, but when we have a receiver
> below the sort, most of (if not all) buffers are child buffers. This may
> delay spilling, and increase the memory usage of the drillbit. If my
> computations are correct, for a single query, one drillbit can allocate up
> to 40GB without spilling once to disk.
>
> Is there a specific reason we only account for root buffers ?
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
>

Re: Java graphical application being launched during the Drill build?

2015-11-13 Thread Steven Phillips

I actually see it when running without tests as well.

On Fri, Nov 13, 2015 at 10:55 AM, Hsuan Yi Chu  wrote:

> Not bad feature, which gives the visualization of unit test completion.
>
> On Fri, Nov 13, 2015 at 10:27 AM, Parth Chandra  wrote:
>
> > Yes I see it too. Just a minor annoyance I thought.
> >
> > On Mon, Nov 9, 2015 at 2:59 PM, Sudheesh Katkam 
> > wrote:
> >
> > > I did, on my Mac. However, I haven’t looked into it.
> > >
> > > > On Nov 9, 2015, at 2:57 PM, Jason Altekruse <
> altekruseja...@gmail.com>
> > > wrote:
> > > >
> > > > Hello all,
> > > >
> > > > Has anyone else noticed a java graphical application starting up when
> > > > running the full drill build with test? On my mac I can clearly see a
> > new
> > > > icon appear on my task bar for a generic java application after I
> > launch
> > > > the build and tests from the command line. I only started seeing this
> > > > recently, I don't remember seeing any mail about a change in the
> build
> > > that
> > > > would have caused this.
> > > >
> > > > Has anyone else seen this? I'll be looking into it a little more,
> > trying
> > > to
> > > > identify when it comes up and find the test or build phase that is
> > > spawning
> > > > it, but I thought I'd just ask for any ideas to get me started.
> > > >
> > > > Thanks,
> > > > Jason
> > >
> > >
> >
>

[jira] [Created] (DRILL-4081) Handle schema changes in ExternalSort

2015-11-12 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-4081:
--

 Summary: Handle schema changes in ExternalSort
 Key: DRILL-4081
 URL: https://issues.apache.org/jira/browse/DRILL-4081
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Steven Phillips
Assignee: Steven Phillips


This improvement will make use of the Union vector to handle schema changes. 
When a new schema appears, the schema will be "merged" with the previous 
schema. The result will be a new schema that uses Union type to store the 
columns where this is a type conflict. All of the batches (including the 
batches that have already arrived) will be coerced into this new schema.

A new comparison function will be included to handle the comparison of Union 
type. Comparison of union type will work as follows:

1. All numeric types can be mutually compared, and will be compared using Drill 
implicit cast rules.

2. All other types will not be compared against other types, but only among 
values of the same type.

3. There will be an overall precedence of types with regards to ordering. This 
precedence is not yet defined, but will be as part of the work on this issue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] Release Apache Drill 1.3.0 (rc2)

2015-11-12 Thread Steven Phillips

Does DRILL-4070 cause incorrect results? Or just prevent partition pruning?

On Thu, Nov 12, 2015 at 10:32 AM, Jason Altekruse 
wrote:

> I just commented on the JIRA, we are behaving correctly for newly created
> parquet files. I did confirm the failure to prune on auto-partitioned files
> created by 1.2. I do not think this is a release blocker, because I do not
> think we can solve this in Drill code without risking wrong results over
> parquet files written by other tools. I do support the creation of a
> migration utility for existing files written by Drill 1.2, but this can be
> released independent of 1.3.
>
>
> On Thu, Nov 12, 2015 at 10:26 AM, Jinfeng Ni 
> wrote:
>
> > Agree with Aman that DRILL-4070 is a show stopper. Parquet is the
> > major data source Drill uses. If this release candidate breaks the
> > backward compatibility of partitioning pruning for the parquet files
> > created with prior release of Drill, it could cause serious problem
> > for the current Drill user.
> >
> > -1
> >
> >
> >
> > On Thu, Nov 12, 2015 at 10:10 AM, rahul challapalli
> >  wrote:
> > > -1 (non-binding)
> > > The nature of the issue (DRILL-4070) demands adequate testing even
> with a
> > > workaround in place.
> > >
> > > On Thu, Nov 12, 2015 at 9:32 AM, Aman Sinha 
> > wrote:
> > >
> > >> Given this issue, I would be a -1  unfortunately.
> > >>
> > >> On Thu, Nov 12, 2015 at 8:42 AM, Aman Sinha 
> > wrote:
> > >>
> > >> > Can someone familiar with the parquet changes take a look at
> > DRILL-4070 ?
> > >> > It seems to break backward compatibility.
> > >> >
> > >> > On Tue, Nov 10, 2015 at 9:51 PM, Jacques Nadeau  >
> > >> > wrote:
> > >> >
> > >> >> Hey Everybody,
> > >> >>
> > >> >> I'd like to propose a new release candidate of Apache Drill,
> version
> > >> >> 1.3.0.  This is the third release candidate (rc2).  This addresses
> > some
> > >> >> issues identified in the the second release candidate including
> some
> > >> test
> > >> >> issues & rpc concurrency issues.
> > >> >>
> > >> >> The tarball artifacts are hosted at [2] and the maven artifacts are
> > >> hosted
> > >> >> at [3]. This release candidate is based on commit
> > >> >> 13ab6b1f9897ebcf9179407ffaf84b79b0ee95a1 located at [4].
> > >> >> The vote will be open for 72 hours ending at 10PM Pacific, November
> > 13,
> > >> >> 2015.
> > >> >>
> > >> >> [ ] +1
> > >> >> [ ] +0
> > >> >> [ ] -1
> > >> >>
> > >> >> thanks,
> > >> >> Jacques
> > >> >>
> > >> >> [1]
> > >> >>
> > >> >>
> > >>
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820=12332946
> > >> >> [2]http://people.apache.org/~jacques/apache-drill-1.3.0.rc2/
> > >> >> [3]
> > >> >>
> > https://repository.apache.org/content/repositories/orgapachedrill-1013/
> > >> >> [4] https://github.com/jacques-n/drill/tree/drill-1.3.0
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Jacques Nadeau
> > >> >> CTO and Co-Founder, Dremio
> > >> >>
> > >> >
> > >> >
> > >>
> >
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

2015-11-09 Thread Steven Phillips

+1 on merging this soon.

Going forward, I agree it makes sense to break the RPC module into a
stand-alone module that is not specific to drill. But whether it is better
for it live in the Drill project or in the new Vector project, I am not
sure.

On Sun, Nov 8, 2015 at 6:42 PM, Jacques Nadeau  wrote:

> FYI, the patch also just successfully completed the extended regression
> suite.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sun, Nov 8, 2015 at 5:09 PM, Jacques Nadeau  wrote:
>
> > Ok guys,
> >
> > I took the quiet time directly after the release candidate went out to do
> > the first phase of componentization. You can see my work at [1].
> >
> > This set of commits has little functional impact. I've also done my best
> > to avoid package or file renaming, rather keeping things in their same
> > packages but in different modules (so that other patches are more easily
> > applied). There are nine commits in the branch. They break down into
> three
> > categories: MOVE, REFACTOR & CLEANUP.
> >
> > I've separated the changes out so that it should be reasonably
> > straightforward to review. The MOVE patches are constrained primarily to
> > moving files from module to another.
> >
> > DRILL-3987: (MOVE) Extract key vector, field reader, complex/field wr… …
> > 21cbd84
> > DRILL-3987: (REFACTOR) Common and Vector modules building. … e390db9
> > DRILL-3987: (REFACTOR) Working TPCH unit tests … 2cc1d30
> > DRILL-3987: (MOVE) Extract RPC, memory-base and memory-impl as separa… …
> > d5f3211
> > DRILL-3987: (REFACTOR) Extract BoundsChecking check from AssertionUti… …
> > 83c53d8
> > DRILL-3987: (CLEANUP) Delete unused files 5d596d5
> > DRILL-3987: (REFACTOR) Remove any parent Drill dependencies for drill… …
> > 76f578c
> > DRILL-3987: (MOVE) Move logical expressions and operators out of comm… …
> > f908b8b
> > DRILL-3987: (CLEANUP) Final cleanups to get complete working build/di… …
> > d09aa3b
> >
> > The main goal was to extract a number of separate java-exec submodules.
> > I've also outlined the modularization in a couple slides at [2]. In those
> > slides you'll see that there are some orange dependencies that will need
> to
> > be removed in a second phase of effort. We also need to decide which
> > portions of the third slide at [2] would be appropriate as a separate
> > project versus maintained inside of Drill.
> >
> > Some of the dependencies will need a finer grained hand to separate. The
> > biggest remaining is cleaning up VectorDescriptor, MaterializedField,
> > SerializedField, SchemaPath and FieldReference so that vector can stop
> > depending on the new drill-logical module.
> >
> > My preference would be to merge this straight away as the functional
> > impact is limited and it would be exceedingly difficult to maintain this
> > patch. This patch set provides a complete set of changes for
> modularization
> > and passes all unit tests. I'm running the extended regression suite now
> to
> > confirm no impact on those issues. I don't expect any since the only bugs
> > I've had to track down thus far are drill-module or pom dependency
> issues.
> >
> > Let me know your thoughts.
> >
> > [1] https://github.com/apache/drill/pull/250
> > [2]
> >
> https://docs.google.com/presentation/d/1HD-EzAgNe4EJvoP91ILFLFJdFjT2T5yfM9MEv79BaiM/edit
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Oct 27, 2015 at 5:59 PM, Jacques Nadeau 
> > wrote:
> >
> >> Yes, I've started the umbrella @
> >> https://issues.apache.org/jira/browse/DRILL-3986
> >>
> >> And the first sub task: extraction poc @
> >> https://issues.apache.org/jira/browse/DRILL-3987
> >>
> >> I posted some existing materials. I'll start looking at how we can
> >> extract. Would love others thoughts about how we might slice things.
> I'll
> >> post some initial thoughts on the jiras in this regard.
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Tue, Oct 27, 2015 at 5:39 PM, Julian Hyde  wrote:
> >>
> >>> Jacques, Can you please log the JIRA case you mentioned, and also
> attach
> >>> any documentation (e.g. javadoc) you already have.
> >>>
> >>>
> >>
> >
>

Re: [DISCUSS] Ideas to improve metadata cache read performance

2015-10-30 Thread Steven Phillips

My view on storing it in some other format is that, yes, it will probably
reduce the size of the file, but if we gzip the json file, it should be
pretty compact. As for deserialization cost, other formats would be faster,
but not dramatically faster. Certainly not the order of magnitude faster
that we really need it to be. The reason we chose JSON was because it is
readable and easier to deal with.

As for the old code, I can point you at a branch, but it's probably not
very helpful. Unless we want to essentially disable value-based partition
pruning when using the cache, the old code will not work.

My recommendation would be to come up with a new version of the format
which stores only the name and value of columns which are single-valued for
each file or row group. This will allow partition pruning to work, but some
count queries may not be as fast any more, because the cache won't have
column value counts on a per-rowgroup basis any more.

Anyway, here is the link to the original branch.

https://github.com/StevenMPhillips/drill/tree/meta

On Fri, Oct 30, 2015 at 3:01 PM, Parth Chandra  wrote:

> Hey Jacques, Steven,
>
>   Do we have a branch somewhere which has the initial prototype code? I'd
> like to prune the file a bit as it looks like reducing the size of the
> metadata cache file might yield the best results.
>
>   Also, did we have a particular reason for going with JSON as opposed to a
> more compact binary format? Are there any arguments against saving this as
> a protobuf/BSON/Parquet file?
>
> Parth
>
> On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau 
> wrote:
>
> > My first thought is we've gotten too generous in what we're storing in
> the
> > Parquet metadata file. Early implementations were very lean and it seems
> > far larger today. For example, early implementations didn't keep
> statistics
> > and ignored row groups (files, schema and block locations only). If we
> need
> > multiple levels of information, we may want to stagger (or normalize)
> them
> > in the file. Also, we may think about what is the minimum that must be
> done
> > in planning. We could do the file pruning at execution time rather than
> > single-tracking these things (makes stats harder though).
> >
> > I also think we should be cautious around jumping to a conclusion until
> > DRILL-3973 provides more insight.
> >
> > In terms of caching, I'd be more inclined to rely on file system caching
> > and make sure serialization/deserialization is as efficient as possible
> as
> > opposed to implementing an application-level cache. (We already have
> enough
> > problems managing memory without having to figure out when we should
> drop a
> > metadata cache :D).
> >
> > Aside, I always liked this post for entertainment and the thoughts on
> > virtual memory: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes 
> wrote:
> >
> > > One more thing, for workloads running queries over subsets of same
> > parquet
> > > files, we can consider maintaining an in-memory cache as well. Assuming
> > > metadata memory footprint per file is low and parquet files are static,
> > not
> > > needing us to invalidate the cache often.
> > >
> > > H+
> > >
> > > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes 
> > wrote:
> > >
> > > > I am not familiar with the contents of metadata stored but if
> > > > deserialization workload seems to be fitting to any of afterburner's
> > > > claimed improvement points [1] It could well be worth trying given
> the
> > > > claimed gain on throughput is substantial.
> > > >
> > > > It could also be a good idea to partition caching over a number of
> > files
> > > > for better parallelization given number of cache files generated is
> > > > *significantly* less than number of parquet files. Maintaining global
> > > > statistics seems an improvement point too.
> > > >
> > > >
> > > > -H+
> > > >
> > > > 1:
> > > >
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > > >
> > > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha 
> > > wrote:
> > > >
> > > >> Forgot to include the link for Jackson's AfterBurner module:
> > > >>   https://github.com/FasterXML/jackson-module-afterburner
> > > >>
> > > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha 
> > > wrote:
> > > >>
> > > >> > I was going to file an enhancement JIRA but thought I will discuss
> > > here
> > > >> > first:
> > > >> >
> > > >> > The parquet metadata cache file is a JSON file that contains a
> > subset
> > > of
> > > >> > the metadata extracted from the parquet files.  The cache file can
> > get
> > > >> > really large .. a few GBs for a few hundred thousand files.
> > > >> > I have filed a separate JIRA: DRILL-3973 for profiling the various
> > > >> aspects
> > > >> > of

Re: [DISCUSS] Ideas to improve metadata cache read performance

2015-10-29 Thread Steven Phillips

I agree that this would present a small challenge for testing, but I don't
think ease of testing should be the primary motivator in designing the
software. Once we've decided what we want the software to do, then we can
work together to figure out how to test it.

On Thu, Oct 29, 2015 at 11:09 AM, rahul challapalli <
challapallira...@gmail.com> wrote:

> @steven If we end up pushing the partition pruning to the execution phase,
> how would we know that partition pruning even took place. I am thinking
> from the standpoint of adding functional tests around partition pruning.
>
> - Rahul
>
> On Wed, Oct 28, 2015 at 10:53 AM, Parth Chandra <par...@apache.org> wrote:
>
> > And ideally, I suppose, the merged schema would correspond to the
> > information that we want to keep in a .drill file.
> >
> >
> > On Tue, Oct 27, 2015 at 4:55 PM, Aman Sinha <asi...@maprtech.com> wrote:
> >
> > > @Steven, w.r.t to your suggestion about doing the metadata operation
> > during
> > > execution phase, see the related discussion in DRILL-3838.
> > >
> > > A couple of more thoughts:
> > >  - Parth and I were discussing keeping track of the merged schema as
> part
> > > of the refresh metadata and storing the merged schema for all files
> that
> > > have the identical schema (currently this is repeated and is a huge
> > > contributor to the size of the file).   To Jacques' point about keeping
> > > minimum information needed for planning purposes,  we certainly could
> do
> > a
> > > better job in keeping it lean.   The row count of the table could be
> > > computed at the time of running refresh metadata command.  Similarly
> the
> > > analysis of single-value can be done at that time instead of on a
> > per-query
> > > basis.
> > >
> > >  - We should revisit DRILL-2517(
> > > https://issues.apache.org/jira/browse/DRILL-2517)
> > >   Consider the following 2 queries and their total elapsed times
> against
> > a
> > > table with 31 files:
> > > (A) SELECT  count(*) FROM table WHERE `date` = '2015-07-01';
> > >   elapsed time: 980 secs
> > >
> > > (B) SELECT count(*) FROM  `table/20150701` ;
> > >   elapsed time: 54 secs
> > >
> > > From the user perspective, both queries should perform nearly the
> > same,
> > > which was essentially the intent of DRILL-2517.
> > >
> > >
> > > On Tue, Oct 27, 2015 at 12:04 PM, Steven Phillips <ste...@dremio.com>
> > > wrote:
> > >
> > > > I think we need to come up with a way to push partition pruning to
> > > > execution time.  The other solutions may relive the problem in some
> > > cases,
> > > > but won't solve the fundamental problem.
> > > >
> > > > For example, even if we do figure out how to use multiple threads for
> > > > reading the metadata, that may be fine for a couple hundred thousand
> > > files,
> > > > but what about when we have millions or tens of millions of files. It
> > > will
> > > > still be a huge bottle neck.
> > > >
> > > > I actually think we should use the Drill execution engine to probe
> the
> > > > metadata and generate the work assignments. We could have an
> additional
> > > > fragment or fragments of the query that would recursively probe the
> > > > filesystem, read the metadata, and make assignments, and then pipe
> the
> > > > results into the Scanners, which will create readers on the fly. This
> > way
> > > > the query could actually begin doing work before the metadata has
> even
> > > been
> > > > fully read.
> > > >
> > > > On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau <jacq...@dremio.com>
> > > > wrote:
> > > >
> > > > > My first thought is we've gotten too generous in what we're storing
> > in
> > > > the
> > > > > Parquet metadata file. Early implementations were very lean and it
> > > seems
> > > > > far larger today. For example, early implementations didn't keep
> > > > statistics
> > > > > and ignored row groups (files, schema and block locations only). If
> > we
> > > > need
> > > > > multiple levels of information, we may want to stagger (or
> normalize)
> > > > them
> > > > > in the file. Also, we may think about what is the minimum that must
>

Re: Why limit operator keeps calling next() even after the limit is reached ?

2015-10-29 Thread Steven Phillips

I believe kill() will only stop the upstream fragments from sending
batches, but it does nothing about the batches that have already been sent.
When kill() is called on the RawBatchBuffer, this will release all of the
batches in the queue. But I believe it is still necessary to wait for all
remaining batches to arrive, so they can be cleared. It's possible that
it's not necessary to do this in LimitRecordBatch, and that we are handling
this in the RawBatchBuffer. You would have to examine the code to confirm.

On Thu, Oct 29, 2015 at 11:38 AM, Abdel Hakim Deneche  wrote:

> Hey all,
>
> As part of DRILL-991 when LimitRecordBatch receives enough records it calls
> kill() on it's upstream to inform the remaining operators and fragments
> that they can stop sending batches.
>
> But, limit operator will keep calling next() until it gets a NONE. Is there
> a specific reason for this behavior ?
>
> Thanks
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
>

Re: [DISCUSS] Ideas to improve metadata cache read performance

2015-10-27 Thread Steven Phillips

I think we need to come up with a way to push partition pruning to
execution time.  The other solutions may relive the problem in some cases,
but won't solve the fundamental problem.

For example, even if we do figure out how to use multiple threads for
reading the metadata, that may be fine for a couple hundred thousand files,
but what about when we have millions or tens of millions of files. It will
still be a huge bottle neck.

I actually think we should use the Drill execution engine to probe the
metadata and generate the work assignments. We could have an additional
fragment or fragments of the query that would recursively probe the
filesystem, read the metadata, and make assignments, and then pipe the
results into the Scanners, which will create readers on the fly. This way
the query could actually begin doing work before the metadata has even been
fully read.

On Mon, Oct 26, 2015 at 2:42 PM, Jacques Nadeau  wrote:

> My first thought is we've gotten too generous in what we're storing in the
> Parquet metadata file. Early implementations were very lean and it seems
> far larger today. For example, early implementations didn't keep statistics
> and ignored row groups (files, schema and block locations only). If we need
> multiple levels of information, we may want to stagger (or normalize) them
> in the file. Also, we may think about what is the minimum that must be done
> in planning. We could do the file pruning at execution time rather than
> single-tracking these things (makes stats harder though).
>
> I also think we should be cautious around jumping to a conclusion until
> DRILL-3973 provides more insight.
>
> In terms of caching, I'd be more inclined to rely on file system caching
> and make sure serialization/deserialization is as efficient as possible as
> opposed to implementing an application-level cache. (We already have enough
> problems managing memory without having to figure out when we should drop a
> metadata cache :D).
>
> Aside, I always liked this post for entertainment and the thoughts on
> virtual memory: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Oct 26, 2015 at 2:25 PM, Hanifi Gunes  wrote:
>
> > One more thing, for workloads running queries over subsets of same
> parquet
> > files, we can consider maintaining an in-memory cache as well. Assuming
> > metadata memory footprint per file is low and parquet files are static,
> not
> > needing us to invalidate the cache often.
> >
> > H+
> >
> > On Mon, Oct 26, 2015 at 2:10 PM, Hanifi Gunes 
> wrote:
> >
> > > I am not familiar with the contents of metadata stored but if
> > > deserialization workload seems to be fitting to any of afterburner's
> > > claimed improvement points [1] It could well be worth trying given the
> > > claimed gain on throughput is substantial.
> > >
> > > It could also be a good idea to partition caching over a number of
> files
> > > for better parallelization given number of cache files generated is
> > > *significantly* less than number of parquet files. Maintaining global
> > > statistics seems an improvement point too.
> > >
> > >
> > > -H+
> > >
> > > 1:
> > >
> >
> https://github.com/FasterXML/jackson-module-afterburner#what-is-optimized
> > >
> > > On Sun, Oct 25, 2015 at 9:33 AM, Aman Sinha 
> > wrote:
> > >
> > >> Forgot to include the link for Jackson's AfterBurner module:
> > >>   https://github.com/FasterXML/jackson-module-afterburner
> > >>
> > >> On Sun, Oct 25, 2015 at 9:28 AM, Aman Sinha 
> > wrote:
> > >>
> > >> > I was going to file an enhancement JIRA but thought I will discuss
> > here
> > >> > first:
> > >> >
> > >> > The parquet metadata cache file is a JSON file that contains a
> subset
> > of
> > >> > the metadata extracted from the parquet files.  The cache file can
> get
> > >> > really large .. a few GBs for a few hundred thousand files.
> > >> > I have filed a separate JIRA: DRILL-3973 for profiling the various
> > >> aspects
> > >> > of planning including metadata operations.  In the meantime, the
> > >> timestamps
> > >> > in the drillbit.log output indicate a large chunk of time spent in
> > >> creating
> > >> > the drill table to begin with, which indicates bottleneck in reading
> > the
> > >> > metadata.  (I can provide performance numbers later once we confirm
> > >> through
> > >> > profiling).
> > >> >
> > >> > A few thoughts around improvements:
> > >> >  - The jackson deserialization of the JSON file is very slow.. can
> > this
> > >> be
> > >> > speeded up ? .. for instance the AfterBurner module of jackson
> claims
> > to
> > >> > improve performance by 30-40% by avoiding the use of reflection.
> > >> >  - The cache file read is a single threaded process.  If we were
> > >> directly
> > >> > reading from parquet files, we use a default of 16 threads.  What
> can
> > be
> > >> > done

Re: HBase get.setTimeRange() support

2015-10-21 Thread Steven Phillips

I would think if we want to expose the timestamp field, we should add
another layer of nesting. In other words, every qualifier, which is
currently a single value, would actually be a map, which includes a value
field and timestamp field. Of course, we could also take it a step further
and expose versions as well, in which case each qualifier would be a
repeated map.

There's also the idea of using table functions to modify the table scan
using the API of the underlying storage if there is something that is not
easily expressed via SQL.

On Wed, Oct 21, 2015 at 12:05 PM, Jacques Nadeau  wrote:

> We've talked about how to expose this but haven't yet exposed it. What do
> you think a good way to expose this would be? A separate set of columns
> (e.g. qualifier and qualifier_timestamp)? Do you also need access to
> multiple versions or only the timestamp?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Oct 20, 2015 at 6:33 PM, Grant Priestley <
> grant.priest...@colesfinancialservices.com.au> wrote:
>
> > Hi Drill Developers,
> >
> >
> >
> > I am hoping to introduce Drill into a Big Data environment in the hopes
> of
> > supporting analysts with temporal queries off HBase.  After going through
> > the documentation online, I havben’t been able to find if Drill is able
> to
> > leverage HBase’s  get.setTimeRange() when querying data within HBase.
> Does
> > this functionality exist in Drill 1.2.0?
> >
> >
> >
> > Cheers,
> >
> > Grant
> >
> >
> >
> > Grant Priestley
> > Domain Architect (Analytics) | Coles Financial Services
> >
> > L1 M12 800 Toorak Road Hawthorn East Victoria 3123 Australia
> > T  |  +61 416 320 936   e  |
> > grant.priest...@colesfinancialservices.com.au
> >
> > [image: cid:image001.png@01D04B73.BA61ACB0]
> >
> >
> >
>

Re: List type

2015-10-19 Thread Steven Phillips

In the work I did for the Union types, (see PR
https://github.com/apache/drill/pull/207), I actually went down that exact
path. In that branch, if Union type is enable, any vectors created through
the ComplexWriter interface will not create any Repeated type vectors.

On Mon, Oct 19, 2015 at 2:29 PM, Hanifi Gunes  wrote:

> If I am not wrong currently we use
> i) RepeatedInt for single
> ii) RepeatedList of RepeatedInt for double
> iii) RepeatedList of RepeatedList of RepeatedInt for triple arrays.
>
> I think we should refactor vector design in such way that we will only have
> a ListVector eliminating the need for all Repeated* vectors as well as code
> generation for those so that we would represent all these above types via
> i) ListVector of IntVector
> ii) ListVector of ListVector of IntVector
> iii) ListVector of ListVector of ListVector of IntVector
>
> The idea here is to favor aggregation over inheritance, which is less
> redundant and more powerful. Thinking about it, we do not even need to
> maintain RepeatedMapVector as it will simply be ListVector of MapVector in
> the new dialect.
>
> -Hanifi
>
> ps: As an fyi, even though it does not include a JIRA for abstracting out a
> ListVector which I discussed over the past months with many devs, [1] has a
> list of items in place for refactoring vectors (and possibly the type
> system).
>
> 1: https://issues.apache.org/jira/browse/DRILL-2147
>
>
> On Mon, Oct 19, 2015 at 1:28 PM, Julien Le Dem  wrote:
>
> > I'm looking at the type system in Drill and I have the following
> question:
> > Why is there a LIST type and a REPEATED field?
> > It sounds like there should only one of those 2 concepts.
> > Could someone describe how the following are represented?
> > - one dimensional list of int
> > - 2 dimensional list of ints
> > - 3 dimensional list of ints
> > Thank you
> >
> > --
> > Julien
> >
>

Re: flatten() function, scalar functions, nested ?

2015-10-12 Thread Steven Phillips

I personally think the current usage of flatten is very unintuitive and
confusing, and I think the BigQuery usage is much better. If I were
designing this function from scratch, I would not allow using flatten in
the select cause and only allow it as a table function.

For example, take this table:

0: jdbc:drill:drillbit=localhost> select * from t6;
+++
|   a| b  |
+++
| [1,2]  | 1  |
| [3]| 2  |
| [3]| 3  |


It's not clear how to flatten the column a, and return all of the columns.
The closest option is:

0: jdbc:drill:drillbit=localhost> select *, flatten(a) as a_flattened from
t6;
+++--+
|   a| b  | a_flattened  |
+++--+
| [1,2]  | 1  | 1|
| [1,2]  | 1  | 2|
| [3]| 2  | 3|
| [3]| 3  | 3|
+++--+

But that doesn't really make any sense. Why would I want the original,
unflattened column, duplicated multiple times?

Anyway, I think this usage is very confusing. Is there a use case where
this is necessary?



On Mon, Oct 12, 2015 at 5:28 PM, Julian Hyde  wrote:

>
> > On Oct 12, 2015, at 3:42 PM, Jacques Nadeau  wrote:
> >
> > - we have shortcut for a lateral join combined with a table function used
> > in the select clause
>
> It’s funny, Postgres has a short-cut that allows you to use UNNEST in the
> SELECT clause[1]. James and I discussed it for Phoenix Unnest support[2],
> and I’ll recap what I said there.
>
> The semantics of a table expression in the SELECT clause are weird,
> because you get multiple rows out for each row in. It gets even weirder if
> you have more than one table expression in the SELECT clause and some
> non-table expressions too. Presumably it should return the cartesian
> product.
>
> LINQ (and Spark) has “selectMany”, which is like “select” except that the
> expression is a collection and one row is output for each member of the
> collection. Bart de Smet claims that selectMany is powerful enough to
> subsume the other relational operators (see [3] around 48 minutes). So, I’m
> tempted to add “SELECT MANY” to Calcite SQL. But I think the way postgres
> did it — changing the behavior of the SELECT clause if it happens to
> contain an UNNEST function — is wrong. The workaround — learning how to use
> UNNEST or indeed a table function such as FLATTEN in the FROM clause — is
> not too hard.
>
> Julian
>
> [1]
> http://stackoverflow.com/questions/23003601/sql-multiple-unnest-in-single-select-list
> <
> http://stackoverflow.com/questions/23003601/sql-multiple-unnest-in-single-select-list
> >
>
> [2] https://issues.apache.org/jira/browse/PHOENIX-953 <
> https://issues.apache.org/jira/browse/PHOENIX-953>
>
> [3]
> https://channel9.msdn.com/Shows/Going+Deep/Bart-De-Smet-MinLINQ-The-Essence-of-LINQ
> <
> https://channel9.msdn.com/Shows/Going+Deep/Bart-De-Smet-MinLINQ-The-Essence-of-LINQ
> >

[jira] [Created] (DRILL-3912) Common subexpression elimination

2015-10-07 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3912:
--

 Summary: Common subexpression elimination
 Key: DRILL-3912
 URL: https://issues.apache.org/jira/browse/DRILL-3912
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips


Drill currently will evaluate the full expression tree, even if there are 
redundant subtrees. Many of these redundant evaluations can be eliminated by 
reusing the results from previously evaluated expression trees.

For example,

{code}
select a + 1, (a + 1)* (a - 1) from t
{code}

Will compute the entire (a + 1) expression twice. With CSE, it will only be 
evaluated once.

The benefit will be reducing the work done when evaluating expressions, as well 
as reducing the amount of code that is generated, which could also lead to 
better JIT optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-3909) Decimal round functions corrupts input data

2015-10-07 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3909:
--

 Summary: Decimal round functions corrupts input data
 Key: DRILL-3909
 URL: https://issues.apache.org/jira/browse/DRILL-3909
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
 Fix For: 1.3.0


The Decimal 28 and 38 round functions, instead of creating a new buffer and 
copying data from the incoming buffer, set the output buffer equal to the input 
buffer, and then subsequently mutate the data in that buffer. This causes the 
data in the input buffer to be corrupted.

A simple example to reproduce:
{code}
$ cat a.json
{ a : "9.95678" }


0: jdbc:drill:drillbit=localhost> create table a as select cast(a as 
decimal(38,18)) a from `a.json`;
+---++
| Fragment  | Number of records written  |
+---++
| 0_0   | 1  |
+---++
1 row selected (0.206 seconds)
0: jdbc:drill:drillbit=localhost> select round(a, 9) from a;
+---+
|EXPR$0 |
+---+
| 10.0  |
+---+
1 row selected (0.121 seconds)
0: jdbc:drill:drillbit=localhost> select round(a, 11) from a;
++
| EXPR$0 |
++
| 9.957  |
++
1 row selected (0.115 seconds)
0: jdbc:drill:drillbit=localhost> select round(a, 9), round(a, 11) from a;
+---++
|EXPR$0 | EXPR$1 |
+---++
| 10.0  | 1.000  |
+---++
{code}

In the third example, there are two round expressions operating on the same 
incoming decimal vector, and you can see that the result for the second 
expression is incorrect.

Not critical because Decimal type is considered alpha right now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [VOTE] Release Apache Drill 1.2.0 (rc0)

2015-10-07 Thread Steven Phillips

I think we should do a new candidate. We have two fixes that seem somewhat
important.

On Wed, Oct 7, 2015 at 10:37 AM, Abdel Hakim Deneche 
wrote:

> the only way to include any new fixes into 1.2.0 is to sink the current
> release candidate and start another one.
>
> is DRILL-3892 a must have for 1.2.0 ?
>
> Thanks
>
> On Wed, Oct 7, 2015 at 10:29 AM, Aman Sinha  wrote:
>
> > DRILL-3901 patch is merged and I took the liberty to mark it for 1.2.0
> but
> > will leave it up to the release manager to decide how to incorporate
> it.  I
> > also strongly feel that DRILL-3892 (incorrectly reported status of
> metadata
> > cache *not* being used when in fact it was used) should be included in
> 1.2
> > .
> >
> > On Wed, Oct 7, 2015 at 8:58 AM, Jacques Nadeau 
> wrote:
> >
> > > We're seeing an issue around the JDBC plugin. We're still debugging but
> > it
> > > might also warrant a fix for the release. Updates soon.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Tue, Oct 6, 2015 at 11:45 AM, Aman Sinha 
> wrote:
> > >
> > > > I have filed DRILL-3901 for a performance issue that we are trying to
> > > > address.  We can discuss whether to continue with the existing
> release
> > > > candidate or wait for a fix.
> > > >
> > > > On Tue, Oct 6, 2015 at 9:38 AM, Edmon Begoli 
> > wrote:
> > > >
> > > > > Humbly, +1.
> > > > >
> > > > > On Tue, Oct 6, 2015 at 12:32 PM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com>
> > > > > wrote:
> > > > >
> > > > > > verified the artifacts checksums and that they are signed by my
> gpg
> > > > key.
> > > > > > Built Drill from source in MacOS and CentOS and both builds were
> > > > > successful
> > > > > > and all unit tests passed. Run some window functions queries and
> > > > > everything
> > > > > > seems fine.
> > > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Mon, Oct 5, 2015 at 1:59 PM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Aman, I used JIRA release notes generator. It includes all
> JIRAs
> > > > marked
> > > > > > > "Fix for" 1.2.0. I guess we just need to move all JIRAs still
> > open
> > > > and
> > > > > > > marked as 1.2.0 to 1.3.0 or Future.
> > > > > > >
> > > > > > > On Mon, Oct 5, 2015 at 1:54 PM, Aman Sinha <
> asi...@maprtech.com>
> > > > > wrote:
> > > > > > >
> > > > > > >> I see the following in the release notes:  this is not
> supported
> > > > yet.
> > > > > > Are
> > > > > > >> you using the correct 'status' condition in your query ?
> > > > > > >>
> > > > > > >>- [DRILL-3534 <
> > > https://issues.apache.org/jira/browse/DRILL-3534
> > > > >]
> > > > > -
> > > > > > >>Insert into table support
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mon, Oct 5, 2015 at 1:16 PM, Abdel Hakim Deneche <
> > > > > > >> adene...@maprtech.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > One precision, the commit that should show up in the release
> > is
> > > > the
> > > > > > >> > following:
> > > > > > >> >
> > > > > > >> > b418397790e7e00505846d48bc6458d710c00095
> > > > > > >> > upgrading maven-release plugin to fix release issues
> > > > > > >> >
> > > > > > >> > master has already moved past that commit
> > > > > > >> >
> > > > > > >> > thanks
> > > > > > >> >
> > > > > > >> > On Mon, Oct 5, 2015 at 11:00 AM, Abdel Hakim Deneche <
> > > > > > >> > adene...@maprtech.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Hey all,
> > > > > > >> > >
> > > > > > >> > > I'm happy to propose a new release of Apache Drill,
> version
> > > > 1.2.0.
> > > > > > >> This
> > > > > > >> > is
> > > > > > >> > > the first release candidate (rc0).
> > > > > > >> > >
> > > > > > >> > > Thanks to everyone who contributed to this release, we
> have
> > > more
> > > > > > than
> > > > > > >> 200
> > > > > > >> > > closed and resolved JIRAs
> > > > > > >> > > [1].
> > > > > > >> > >
> > > > > > >> > > The tarball artifacts are hosted at [2] and the maven
> > > artifacts
> > > > > (new
> > > > > > >> for
> > > > > > >> > > this release) are hosted at [3].
> > > > > > >> > >
> > > > > > >> > > The vote will be open for the next 72 hours ending at 11AM
> > > > > Pacific,
> > > > > > >> > > October 8, 2015.
> > > > > > >> > >
> > > > > > >> > > [ ] +1
> > > > > > >> > > [ ] +0
> > > > > > >> > > [ ] -1
> > > > > > >> > >
> > > > > > >> > > thanks,
> > > > > > >> > > Hakim
> > > > > > >> > >
> > > > > > >> > > [1]
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332042=12313820
> > > > > > >> > > [2]
> > > http://people.apache.org/~adeneche/apache-drill-1.2.0-rc0/
> > > > > > >> > > [3]
> > > > > > >> >
> > > > > >
> > > https://repository.apache.org/content/repositories/orgapachedrill-1004
> > > > > > >> > > --
> > > > >

Re: [VOTE] Release Apache Drill 1.2.0 (rc0)

2015-10-07 Thread Steven Phillips

There is also the jdbc storage issue, which Andrew says he has a fix for.
It's just a packaging problem, but given that it's one of the main features
of this release, I think it's important to get in.

On Wed, Oct 7, 2015 at 10:39 AM, Abdel Hakim Deneche 
wrote:

> sorry, I meant is DRILL-3901 a must have for 1.2.0 ?
>
> On Wed, Oct 7, 2015 at 10:37 AM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> wrote:
>
> > the only way to include any new fixes into 1.2.0 is to sink the current
> > release candidate and start another one.
> >
> > is DRILL-3892 a must have for 1.2.0 ?
> >
> > Thanks
> >
> > On Wed, Oct 7, 2015 at 10:29 AM, Aman Sinha  wrote:
> >
> >> DRILL-3901 patch is merged and I took the liberty to mark it for 1.2.0
> but
> >> will leave it up to the release manager to decide how to incorporate it.
> >> I
> >> also strongly feel that DRILL-3892 (incorrectly reported status of
> >> metadata
> >> cache *not* being used when in fact it was used) should be included in
> 1.2
> >> .
> >>
> >> On Wed, Oct 7, 2015 at 8:58 AM, Jacques Nadeau 
> >> wrote:
> >>
> >> > We're seeing an issue around the JDBC plugin. We're still debugging
> but
> >> it
> >> > might also warrant a fix for the release. Updates soon.
> >> >
> >> > --
> >> > Jacques Nadeau
> >> > CTO and Co-Founder, Dremio
> >> >
> >> > On Tue, Oct 6, 2015 at 11:45 AM, Aman Sinha 
> >> wrote:
> >> >
> >> > > I have filed DRILL-3901 for a performance issue that we are trying
> to
> >> > > address.  We can discuss whether to continue with the existing
> release
> >> > > candidate or wait for a fix.
> >> > >
> >> > > On Tue, Oct 6, 2015 at 9:38 AM, Edmon Begoli 
> >> wrote:
> >> > >
> >> > > > Humbly, +1.
> >> > > >
> >> > > > On Tue, Oct 6, 2015 at 12:32 PM, Abdel Hakim Deneche <
> >> > > > adene...@maprtech.com>
> >> > > > wrote:
> >> > > >
> >> > > > > verified the artifacts checksums and that they are signed by my
> >> gpg
> >> > > key.
> >> > > > > Built Drill from source in MacOS and CentOS and both builds were
> >> > > > successful
> >> > > > > and all unit tests passed. Run some window functions queries and
> >> > > > everything
> >> > > > > seems fine.
> >> > > > >
> >> > > > > +1 (binding)
> >> > > > >
> >> > > > > On Mon, Oct 5, 2015 at 1:59 PM, Abdel Hakim Deneche <
> >> > > > adene...@maprtech.com
> >> > > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Aman, I used JIRA release notes generator. It includes all
> JIRAs
> >> > > marked
> >> > > > > > "Fix for" 1.2.0. I guess we just need to move all JIRAs still
> >> open
> >> > > and
> >> > > > > > marked as 1.2.0 to 1.3.0 or Future.
> >> > > > > >
> >> > > > > > On Mon, Oct 5, 2015 at 1:54 PM, Aman Sinha <
> asi...@maprtech.com
> >> >
> >> > > > wrote:
> >> > > > > >
> >> > > > > >> I see the following in the release notes:  this is not
> >> supported
> >> > > yet.
> >> > > > > Are
> >> > > > > >> you using the correct 'status' condition in your query ?
> >> > > > > >>
> >> > > > > >>- [DRILL-3534 <
> >> > https://issues.apache.org/jira/browse/DRILL-3534
> >> > > >]
> >> > > > -
> >> > > > > >>Insert into table support
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> On Mon, Oct 5, 2015 at 1:16 PM, Abdel Hakim Deneche <
> >> > > > > >> adene...@maprtech.com>
> >> > > > > >> wrote:
> >> > > > > >>
> >> > > > > >> > One precision, the commit that should show up in the
> release
> >> is
> >> > > the
> >> > > > > >> > following:
> >> > > > > >> >
> >> > > > > >> > b418397790e7e00505846d48bc6458d710c00095
> >> > > > > >> > upgrading maven-release plugin to fix release issues
> >> > > > > >> >
> >> > > > > >> > master has already moved past that commit
> >> > > > > >> >
> >> > > > > >> > thanks
> >> > > > > >> >
> >> > > > > >> > On Mon, Oct 5, 2015 at 11:00 AM, Abdel Hakim Deneche <
> >> > > > > >> > adene...@maprtech.com>
> >> > > > > >> > wrote:
> >> > > > > >> >
> >> > > > > >> > > Hey all,
> >> > > > > >> > >
> >> > > > > >> > > I'm happy to propose a new release of Apache Drill,
> version
> >> > > 1.2.0.
> >> > > > > >> This
> >> > > > > >> > is
> >> > > > > >> > > the first release candidate (rc0).
> >> > > > > >> > >
> >> > > > > >> > > Thanks to everyone who contributed to this release, we
> have
> >> > more
> >> > > > > than
> >> > > > > >> 200
> >> > > > > >> > > closed and resolved JIRAs
> >> > > > > >> > > [1].
> >> > > > > >> > >
> >> > > > > >> > > The tarball artifacts are hosted at [2] and the maven
> >> > artifacts
> >> > > > (new
> >> > > > > >> for
> >> > > > > >> > > this release) are hosted at [3].
> >> > > > > >> > >
> >> > > > > >> > > The vote will be open for the next 72 hours ending at
> 11AM
> >> > > > Pacific,
> >> > > > > >> > > October 8, 2015.
> >> > > > > >> > >
> >> > > > > >> > > [ ] +1
> >> > > > > >> > > [ ] +0
> >> > > > > >> > > [ ] -1
> >> > > > > >> > >
> >> > > > > >> > > thanks,
> >> > > > > >> > > Hakim
> >> > > > > >> > >
> >> > > > >

Re: [VOTE] Release Apache Drill 1.2.0 (rc0)

2015-10-07 Thread Steven Phillips

I don't think there is one yet. But I think there will be a jira along with
a fix coming shortly from Andrew.

On Wed, Oct 7, 2015 at 10:46 AM, Abdel Hakim Deneche <adene...@maprtech.com>
wrote:

> Steven, what is the JIRA number for the jdbc storage issue ?
>
> Thanks
>
> On Wed, Oct 7, 2015 at 10:41 AM, Steven Phillips <ste...@dremio.com>
> wrote:
>
> > There is also the jdbc storage issue, which Andrew says he has a fix for.
> > It's just a packaging problem, but given that it's one of the main
> features
> > of this release, I think it's important to get in.
> >
> > On Wed, Oct 7, 2015 at 10:39 AM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > wrote:
> >
> > > sorry, I meant is DRILL-3901 a must have for 1.2.0 ?
> > >
> > > On Wed, Oct 7, 2015 at 10:37 AM, Abdel Hakim Deneche <
> > > adene...@maprtech.com>
> > > wrote:
> > >
> > > > the only way to include any new fixes into 1.2.0 is to sink the
> current
> > > > release candidate and start another one.
> > > >
> > > > is DRILL-3892 a must have for 1.2.0 ?
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Oct 7, 2015 at 10:29 AM, Aman Sinha <asi...@maprtech.com>
> > wrote:
> > > >
> > > >> DRILL-3901 patch is merged and I took the liberty to mark it for
> 1.2.0
> > > but
> > > >> will leave it up to the release manager to decide how to incorporate
> > it.
> > > >> I
> > > >> also strongly feel that DRILL-3892 (incorrectly reported status of
> > > >> metadata
> > > >> cache *not* being used when in fact it was used) should be included
> in
> > > 1.2
> > > >> .
> > > >>
> > > >> On Wed, Oct 7, 2015 at 8:58 AM, Jacques Nadeau <jacq...@dremio.com>
> > > >> wrote:
> > > >>
> > > >> > We're seeing an issue around the JDBC plugin. We're still
> debugging
> > > but
> > > >> it
> > > >> > might also warrant a fix for the release. Updates soon.
> > > >> >
> > > >> > --
> > > >> > Jacques Nadeau
> > > >> > CTO and Co-Founder, Dremio
> > > >> >
> > > >> > On Tue, Oct 6, 2015 at 11:45 AM, Aman Sinha <asi...@maprtech.com>
> > > >> wrote:
> > > >> >
> > > >> > > I have filed DRILL-3901 for a performance issue that we are
> trying
> > > to
> > > >> > > address.  We can discuss whether to continue with the existing
> > > release
> > > >> > > candidate or wait for a fix.
> > > >> > >
> > > >> > > On Tue, Oct 6, 2015 at 9:38 AM, Edmon Begoli <ebeg...@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > Humbly, +1.
> > > >> > > >
> > > >> > > > On Tue, Oct 6, 2015 at 12:32 PM, Abdel Hakim Deneche <
> > > >> > > > adene...@maprtech.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > verified the artifacts checksums and that they are signed by
> > my
> > > >> gpg
> > > >> > > key.
> > > >> > > > > Built Drill from source in MacOS and CentOS and both builds
> > were
> > > >> > > > successful
> > > >> > > > > and all unit tests passed. Run some window functions queries
> > and
> > > >> > > > everything
> > > >> > > > > seems fine.
> > > >> > > > >
> > > >> > > > > +1 (binding)
> > > >> > > > >
> > > >> > > > > On Mon, Oct 5, 2015 at 1:59 PM, Abdel Hakim Deneche <
> > > >> > > > adene...@maprtech.com
> > > >> > > > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Aman, I used JIRA release notes generator. It includes all
> > > JIRAs
> > > >> > > marked
> > > >> > > > > > "Fix for" 1.2.0. I guess we just need to move all JIRAs
> > still
> > > >> open
> > > >> > > and
> > > >> >

Re: DRILL-3376 Reading individual files created by CTAS with partition causes an exception

2015-10-07 Thread Steven Phillips

That bug only occurs when the selection is a path to a single file, and
that file is single-valued on the column in the where clause.

The more common use case of querying a directory which contains parquet
files that are each single-valued on a date column does not have this
problem.

Are you seeing this or a similar issue in your queries?

On Wed, Oct 7, 2015 at 8:53 AM, Carboni, Andrea 
wrote:

> Hi all,
>
> could be possible to include in Drill 1.2 the fix for this bug (3376)? The
> usage of Parquet files without the possibility of using WHERE conditions on
> dates is very limiting.
>
> Regards,
> Andrea
>
>
>

Re: [UDF] How do I return NULL

2015-10-06 Thread Steven Phillips

In addition, your UDF needs to have the attribute "nulls =
NullHandling.INTERNAL"

On Tue, Oct 6, 2015 at 8:32 AM, Abdel Hakim Deneche 
wrote:

> Hi Tug,
>
> Let's say your UDF returns an int, your @output field will be defined like
> this:
>
> @Output NullableIntHolder out;
>
>
> To return a NULL you just have to set:
>
> out.isSet = 0;
>
>
> Thanks
>
> On Tue, Oct 6, 2015 at 1:56 AM, Tugdual Grall  wrote:
>
> > Hello Drillers,
> >
> > I am developing a custom function and I would like to return NULL (based
> on
> > the value, for example if the varchar is '' I want my function to return
> > NULL)
> >
> > I have not found the way to do it.
> >
> >
> > Regards
> > Tug
> > @tgrall
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email_medium=Signature_campaign=Free%20available
> >
>

[jira] [Resolved] (DRILL-3887) Parquet metadata cache not being used

2015-10-02 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3887.

Resolution: Fixed

> Parquet metadata cache not being used
> -
>
> Key: DRILL-3887
> URL: https://issues.apache.org/jira/browse/DRILL-3887
> Project: Apache Drill
>  Issue Type: Bug
>    Reporter: Steven Phillips
>Assignee: Mehant Baid
>Priority: Critical
>
> The fix for DRILL-3788 causes a directory to be expanded to its list of files 
> early in the query, and this change causes the ParquetGroupScan to no longer 
> use the parquet metadata file, even when it is there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Timing for 1.2 release, ideas around things to include

2015-10-02 Thread Steven Phillips

I just pushed my fix for DRILL-3887.

On Fri, Oct 2, 2015 at 5:42 PM, Jason Altekruse 
wrote:

> Hey Hakim,
>
> I have been having trouble with the unit tests on my machine today. The
> unit tests passed earlier, but I'm just trying to get a clean run with the
> patch rebased.
>
> - Jason
>
> On Fri, Oct 2, 2015 at 4:10 PM, Abdel Hakim Deneche  >
> wrote:
>
> > Less than an hour left before the 1.2 cutoff (5pm). Any progress on the
> > following issues ?
> >
> > DRILL-2879 Add Remaining portion of "Drill extended json's support $oid"
> > DRILL-3887 Parquet metadata cache not being used
> >
> > Thanks
> >
> >
> > On Fri, Oct 2, 2015 at 1:59 PM, Jim Scott  wrote:
> >
> > > It would still be nice to see:
> > > https://issues.apache.org/jira/browse/DRILL-3423
> > > make it in this release.
> > >
> > > On Fri, Oct 2, 2015 at 3:39 PM, Chris Westin 
> > > wrote:
> > >
> > > > Also https://issues.apache.org/jira/browse/DRILL-3874 , which is
> ready
> > > to
> > > > merge (https://github.com/apache/drill/pull/181).
> > > >
> > > > On Fri, Oct 2, 2015 at 12:43 PM, Abdel Hakim Deneche <
> > > > adene...@maprtech.com>
> > > > wrote:
> > > >
> > > > > DRILL-1065 and 3884 have been merged to master. Remaining issues
> for
> > > 1.2:
> > > > >
> > > > > DRILL-2879 Add Remaining portion of "Drill extended json's support
> > > $oid"
> > > > > DRILL-3887 Parquet metadata cache not being used (regression caused
> > by
> > > > > DRILL-3788)
> > > > > DRILL-2361 Column aliases cannot include dots
> > > > >
> > > > > Any idea when those patches will be merged into master ?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Fri, Oct 2, 2015 at 11:55 AM, Aman Sinha 
> > > wrote:
> > > > >
> > > > > > If no one has objections, I would like to merge Adam's patch for
> > > > > DRILL-2361
> > > > > > (Column aliases cannot include dots).  This has been reviewed and
> > > > tested.
> > > > > >
> > > > > > On Thu, Oct 1, 2015 at 6:58 PM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > DRILL-1065 is ready to be merged, I'm running the final unit
> > tests
> > > > > right
> > > > > > > now.
> > > > > > >
> > > > > > > How about we set the cutoff to tomorrow 5pm pacific time ? any
> > > patch
> > > > > that
> > > > > > > isn't in master by then won't be part of 1.2 ?
> > > > > > >
> > > > > > > On Thu, Oct 1, 2015 at 6:35 PM, Jacques Nadeau <
> > jacq...@dremio.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > We went through the list. I propose that we include only the
> > > > > following
> > > > > > > > issues in 1.2 and move the rest to 1.3
> > > > > > > >
> > > > > > > > DRILL-1065 Provide a reset command to reset an option to its
> > > > default
> > > > > > > value
> > > > > > > > DRILL-3884 Hive native scan has lower parallelization leading
> > to
> > > > > > > > performance degradation
> > > > > > > > DRILL-2879 Add Remaining portion of "Drill extended json's
> > > support
> > > > > > $oid"
> > > > > > > > DRILL-3887 Parquet metadata cache not being used (regression
> > > caused
> > > > > by
> > > > > > > > DRILL-3788)
> > > > > > > >
> > > > > > > > What do other people think?
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Jacques Nadeau
> > > > > > > > CTO and Co-Founder, Dremio
> > > > > > > >
> > > > > > > > On Thu, Oct 1, 2015 at 5:46 PM, Jacques Nadeau <
> > > jacq...@dremio.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm on:
> > > > > > > > >
> > > > > > > > >
> > > > >
> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jacques Nadeau
> > > > > > > > > CTO and Co-Founder, Dremio
> > > > > > > > >
> > > > > > > > > On Thu, Oct 1, 2015 at 5:45 PM, Jacques Nadeau <
> > > > jacq...@dremio.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> How about now?
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Jacques Nadeau
> > > > > > > > >> CTO and Co-Founder, Dremio
> > > > > > > > >>
> > > > > > > > >> On Thu, Oct 1, 2015 at 5:38 PM, Parth Chandra <
> > > > par...@apache.org>
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >>> What time do you want the hangout? I'm in favour of
> bumping
> > > if
> > > > we
> > > > > > > > cannot
> > > > > > > > >>> get decided by tomorrow.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> On Thu, Oct 1, 2015 at 2:05 PM, Jacques Nadeau <
> > > > > jacq...@dremio.com
> > > > > > >
> > > > > > > > >>> wrote:
> > > > > > > > >>>
> > > > > > > > >>> > On Thu, Oct 1, 2015 at 11:17 AM, Abdel Hakim Deneche <
> > > > > > > > >>> > adene...@maprtech.com>
> > > > > > > > >>> >  wrote:
> > > > > > > > >>> >
> > > > > > > > >>> > > I can manage the release.
> > > > > > > > >>> > >
> > > >

[jira] [Created] (DRILL-3887) Parquet metadata cache not being used

2015-10-01 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3887:
--

 Summary: Parquet metadata cache not being used
 Key: DRILL-3887
 URL: https://issues.apache.org/jira/browse/DRILL-3887
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Priority: Critical


The fix for DRILL-3788 causes a directory to be expanded to its list of files 
early in the query, and this change causes the ParquetGroupScan to no longer 
use the parquet metadata file, even when it is there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: STRING_BINARY's intended function?

2015-09-14 Thread Steven Phillips

It looks like the comments for that function are not correct. If you look
at the javadoc for the toBinaryString() method which gets called, you will
get the complete story.

In short, it prints the bytes that are printable, and prints a hex
representation for bytes that are not printable. This is modeled after the
hbase utilities.

On Mon, Sep 14, 2015 at 2:56 PM, Daniel Barclay 
wrote:

> Is STRING_BINARY supposed to map all bytes or only certain bytes to
> hexadecimal?
>
> Daniel
> --
> Daniel Barclay
> MapR Technologies
>

Re: [What is the purpose of ExternalSort's MAX_SORT_BYTES

2015-08-27 Thread Steven Phillips

I think it probably isn't needed anymore. O believe it is a holdover from
before spilling was implemented. It doesn't seem to serve any purpose now.

On Thu, Aug 27, 2015 at 9:17 AM, Abdel Hakim Deneche adene...@maprtech.com
wrote:

 anyone ?

 On Tue, Aug 25, 2015 at 2:56 PM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:

  When running a window function query on large datasets,
  increasing planner.memory.max_query_memory_per_node can actually help the
  query not run out of memory. But in some cases this can cause some issues
  (see DRILL-3555 https://issues.apache.org/jira/browse/DRILL-3555)
 
  This seems to be caused by a hardcoded limit in ExternalSort called
  MAX_SORT_BYTES. What is the purpose of this limit ?
 
  --
 
  Abdelhakim Deneche
 
  Software Engineer
 
http://www.mapr.com/
 
 
  Now Available - Free Hadoop On-Demand Training
  
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
 
 



 --

 Abdelhakim Deneche

 Software Engineer

   http://www.mapr.com/


 Now Available - Free Hadoop On-Demand Training
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Steven Phillips

It would be helpful if you could figure out what the file count is. But
here are some thoughs:

What is the value of the option:
store.partition.hash_distribute

If it is false, which it is by default, then every fragment will
potentially have data in every partition. In this case, that could increase
the number of files by a factor of 8.

On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli 
challapallira...@gmail.com wrote:

 Drillers,

 I executed the below query on TPCH SF100 with drill and it took ~2hrs to
 complete on a 2 node cluster.

 alter session set `planner.width.max_per_node` = 4;
 alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
 create table lineitem partition by (l_shipdate, l_receiptdate) as select *
 from dfs.`/drill/testdata/tpch100/lineitem`;

 The below query returned 75780, so I expected drill to create the same no
 of files or may be a little more. But drill created so many files that a
 hadoop fs -count command failed with a GC overhead limit exceeded. (I
 did not change the default parquet block size)

 select count(*) from (select l_shipdate, l_receiptdate from
 dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate)
 sub;
 +-+
 | EXPR$0  |
 +-+
 | 75780   |
 +-+


 Any thoughts on why drill is creating so many files?

 - Rahul

Re: zeroVectors() interface for value vectors

2015-08-26 Thread Steven Phillips

One possible exception to the access pattern occurs when vectors wrap other
vectors. Specifically, the offset vectors in Variable Length and Repeated
vectors. These vectors are accessed and mutated multiple times. If we are
going to implement strict enforcement, we need to consider that case.

On Tue, Aug 25, 2015 at 7:15 PM, Jacques Nadeau jacq...@dremio.com wrote:

 Yes, by recommendation is to correct the usage in StreamingAggBatch

 --
 Jacques Nadeau
 CTO and Co-Founder, Dremio

 On Tue, Aug 25, 2015 at 4:52 PM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:

  I think zeroVector() is mainly used to fill the vector with zeros, which
 is
  fine if you call it while the vector is in mutate state, but
  StreamingAggBatch does actually call it after setting the value count of
  the value vector which is against the paradigm.
 
 
  On Tue, Aug 25, 2015 at 3:51 PM, Jacques Nadeau jacq...@dremio.com
  wrote:
 
   In all but one situations, this is an internal concern (making sure to
  zero
   out the memory).  For fixed width vectors, there is an assumption that
 an
   initial allocation is clean memory (e.g. all zeros in the faces of an
 int
   vector).  So this should be pulled off a public vector interface.  The
  one
   place where it is being used today is StreamingAggBatch and I think we
   should fix that to follow the state paradigm described above.
  
  
  
   --
   Jacques Nadeau
   CTO and Co-Founder, Dremio
  
   On Tue, Aug 25, 2015 at 3:41 PM, Abdel Hakim Deneche 
   adene...@maprtech.com
   wrote:
  
Another question: FixedWidthVector interface defines a zeroVector()
   method
that
Zero out the underlying buffer backing this vector according to
 it's
javadoc.
   
Where does this method fit in the value vector states described
  earlier ?
it doesn't clear the vector yet it doesn't reset everything to the
  after
allocate state.
   
On Tue, Aug 25, 2015 at 10:46 AM, Abdel Hakim Deneche 
adene...@maprtech.com
 wrote:
   
 One more question about the transition from allocate - mutate. For
   Fixed
 width vectors and BitVector you can actually call setSafe() without
calling
 allocateNew() first and it will work. Should it throw an exception
instead
 ?
 not calling allocateNew() has side effects that could cause
 setSafe()
   to
 throw an OversizedAllocationException if you call setSafe() then
   clear()
 multiple times.

 On Tue, Aug 25, 2015 at 10:01 AM, Chris Westin 
   chriswesti...@gmail.com
 wrote:

 Maybe we should start by putting these rules in a comment in the
  value
 vector base interfaces? The lack of such information is why there
  are
 deviations and other expectations.

 On Tue, Aug 25, 2015 at 8:22 AM, Jacques Nadeau 
 jacq...@dremio.com
  
 wrote:

  There are a few unspoken rules around vectors:
 
  - values need to be written in order (e.g. index 0, 1, 2, 5)
  - null vectors start with all values as null before writing
  anything
  - for variable width types, the offset vector should be all
 zeros
before
  writing
  - you must call setValueCount before a vector can be read
  - you should never write to a vector once it has been read.
 
  The ultimate goal we should get to the point where you the
   interfaces
  guarantee this order of operation:
 
  allocate  mutate  setvaluecount  access  clear (or allocate
 to
start
  the process over, xxx).  Any deviation from this pattern should
   result
 in
  exception.  We should do this only in debug mode as this code is
 extremely
  performance sensitive.  Operations like transfer should be built
  on
top
 of
  this state model.  (In that case, it would mean src moves to
 clear
state
  and target moves to access state.  It also means that transfer
   should
 only
  work in access state.)
 
  If we need special purpose data structures that don't operate in
   these
  ways, we should make sure to keep them separate rather than
 trying
   to
  accommodate a deviation from this pattern in the core vector
 code.
 
  I wrote xxx above because I see the purpose of zeroVectors as
  being
   a
 reset
  on the vector state back to the original state.  Maybe we should
 actually
  call it 'reset' rather than 'zeroVectors'.  This would basically
   pick
 up at
  mutate mode again.
 
  Since these rules were never formalized, I'm sure there are a
 few
places
  where we currently deviate.  We should enforce these rules and
  then
get
  those issues fixed.
 
 
 
  --
  Jacques Nadeau
  CTO and Co-Founder, Dremio
 
  On Tue, Aug 25, 2015 at 8:02 AM, Abdel Hakim Deneche 
  adene...@maprtech.com
  wrote:
 
   Another important point to keep in mind here:
 ValueVectorWriteExpression
   operates under

Re: No of files created by CTAS auto partition feature

2015-08-26 Thread Steven Phillips

That's not really how it works. The only spilling to disk occurs during
External Sort, and the spill files are not created based on partition.

What makes you think it is spilling prematurely?

On Wed, Aug 26, 2015 at 5:15 PM, rahul challapalli 
challapallira...@gmail.com wrote:

 Steven, Jason :

 Below is my understanding of when we should spill to disk while performing
 a sort. Let me know if I am missing anything

 alter session set `planner.width.max_per_node` = 4;
 alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
 (~8GB)
 create table lineitem partition by (l_shipdate, l_receiptdate) as select *
 from dfs.`/drill/testdata/tpch100/lineitem`;

 1. The above query creates 4 minor fragments and each minor fragment gets
 ~2GB for the sort phase.
 2. Once a minor fragment cosumes ~2GB of memory, is starts spilling each
 partition into a separate file to disk
 3. The spilled files would be of different sizes.
 4. Now if it is a regular CTAS (with no partition by clause), each spilled
 file should be approximately ~2GB in size

 I just have a hunch that we are spilling a little early :)

 - Rahul


 On Wed, Aug 26, 2015 at 4:49 PM, rahul challapalli 
 challapallira...@gmail.com wrote:

  Jason,
 
  What you described is exactly my understanding.
 
  I did kickoff a run after setting `store.partition.hash_distribute`. It
 is
  still running. I am expecting the no of files to be slightly more than or
  equal to 75780. (As the default parquet block size should be sufficient
 for
  most of the partitions)
 
  - Rahul
 
 
 
  On Wed, Aug 26, 2015 at 4:36 PM, Jason Altekruse 
 altekruseja...@gmail.com
   wrote:
 
  I feel like there is a little misunderstanding here.
 
  Rahul, did you try setting the option that Steven suggested?
  `store.partition.hash_distribute`
 
  This will cause a re-distribution of the data so that the rows that
 belong
  in a particular partition will all be written by a single writer. They
  will
  not necessarily be all in one file, as we have a limit on file sizes
 and I
  don't think we cap partition size.
 
  The default behavior is not to re-distribute, because it is expensive.
  This
  however means that every fragment will write out a file for whichever
 keys
  appear in the data that ends up at that fragment.
 
  If there is a large number of fragments and the data is spread out
 pretty
  randomly, then there is a reasonable case for turning on this option to
  co-locate data in a single partition to a single writer to reduce the
  number of smaller files. There is no magic formula for when it is best
 to
  turn on this option, but in most cases it will reduce the number of
 files
  produced.
 
 
 
  On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli 
  challapallira...@gmail.com wrote:
 
   Well this for generating some testdata
  
   - Rahul
  
   On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht 
   aengelbre...@maprtech.com wrote:
  
Looks like Drill is doing the partitioning as requested then. May
 not
  be
optimal though.
   
Is there a reason why you want to subpartition this much? You may be
better of to just partition by l_shipdate (not shipmate, autocorrect
  got
   me
there). Or use columns with much lower cardinality to test
   subpartitioning.
   
—Andries
   
   
 On Aug 26, 2015, at 3:05 PM, rahul challapalli 
challapallira...@gmail.com wrote:

 Steven,

 You were right. The count is 606240 which is 8*75780.


 Stefan  Andries,

 Below is the distinct count or cardinality

 select count(*) from (select l_shipdate, l_receiptdate from
 dfs.`/drill/testdata/tpch100/
 lineitem` group by l_shipdate, l_receiptdate) sub;
 +-+
 | EXPR$0  |
 +-+
 | 75780   |
 +-+

 - Rahul





 On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht 
 aengelbre...@maprtech.com wrote:

 What is the distinct count for this columns? IIRC TPC-H has at
  least 5
 years of data irrespective of SF, so you are requesting a lot of
 partitions. 76K sounds about right for 5 years of TPCH shipmate
 and
 correlating receipt date data, your query doesn’t count the
 actual
files.

 Try to partition just on the shipmate column first.

 —Andries


 On Aug 26, 2015, at 12:34 PM, Stefán Baxter 
   ste...@activitystream.com

 wrote:

 Hi,

 Is it possible that the combination values of  (l_shipdate,
 l_receiptdate) have a very high cardinality?
 I would think you are creating partition files for a small
 subset
  of
the
 data.

 Please keep in mind that I know nothing about TPCH SF100 and
 only
  a
 little
 about Drill :).

 Regards,
 -Stefan

 On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips 
 s...@apache.org
wrote:

 It would be helpful if you could figure out what the file count

Re: New JIRA Python tool

2015-08-22 Thread Steven Phillips

The general pattern we have adopted in the Drill community is to pattern
the commit message like this:

DRILL-jira number: Description of what was fixed

As long as you follow that pattern, I don't think there are really any
other expectations for making the pull request.

On Sat, Aug 22, 2015 at 11:24 AM, Edmon Begoli ebeg...@gmail.com wrote:

 Sounds good. Two related questions:


 1. Are there any special procedures regarding the pull request, referencing
 issue in a commit messages, etc.?

 2. Once I figure out the new JIRA Python tool use, how do I submit the
 updates for the Drill contribution and patching documentation?
 Is web documentation also maintained under the repo?

 Thank you,
 Edmon

 On Saturday, August 22, 2015, Hsuan-Yi Chu hsua...@usc.edu wrote:

  Hi Edmon,
  Thanks for bringing this up. I just tried, and easy_install does not work
  on my laptop either.
 
  From my experience, for the purpose of requesting reviews/submitting
  patches, you could send pull request on github. That might be the most
  common way people are using now.
 
  For the documentation, I think update the correct information is good
 idea
  too.
 
  On Sat, Aug 22, 2015 at 10:50 AM, Edmon Begoli ebeg...@gmail.com
  javascript:; wrote:
 
   What is the suitable replacement for the JIRA Python tool
   (jira-python) still specified on the contribution web site?
   https://drill.apache.org/docs/drill-patch-review-tool/
  
   For me, easy_install is not finding jira-python library.
  
   It looks like this is the right tool:
   http://pythonhosted.org/jira/
  
   Which is installed as just jira with pip, but it looks like the setup
 for
   patch submission might be slightly different.
  
   If I am seeing this right and the new tool is needed, we should
 probably
   update the documentation (I will be happy to do so).
  
   Thanks,
   Edmon

Re: Hash Agg vs Streaming Agg for a smaller data set

2015-07-10 Thread Steven Phillips

My guess is that in the second query, the size of the dataset is smaller,
and this causes the cost of sorting to be small enough that it is cheaper
than the HashAgg.

On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli 
challapallira...@gmail.com wrote:

 Hi,

 Info about Data : The data is auto partitioned tpch 0.01 data. The second
 filter is a non-partitioned column, so in the first case the 'OR' predicate
 results in a full-table scan, while in the second case, partition pruning
 takes effect.

 The first case results in a hash agg and the second case in a streaming
 agg. Any idea why?

 1. explain plan for select distinct l_modline, l_moddate from
 `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
 '1992-01-01' or l_shipdate=date'1992-01-01';
 +--+--+
 | text | json |
 +--+--+
 | 00-00Screen
 00-01  Project(l_modline=[$0], l_moddate=[$1])
 00-02Project(l_modline=[$0], l_moddate=[$1])
 00-03  HashAgg(group=[{0, 1}])
 00-04Project(l_modline=[$2], l_moddate=[$0])
 00-05  SelectionVectorRemover
 00-06Filter(condition=[OR(=($0, 1992-01-01), =($1,
 1992-01-01))])
 00-07  Project(l_moddate=[$2], l_shipdate=[$1],
 l_modline=[$0])
 00-08Scan..

 2. explain plan for select distinct l_modline, l_moddate from
 `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
 '1992-01-01' and l_shipdate=date'1992-01-01';
 +--+--+
 | text | json |
 +--+--+
 | 00-00Screen
 00-01  Project(l_modline=[$0], l_moddate=[$1])
 00-02Project(l_modline=[$0], l_moddate=[$1])
 00-03  StreamAgg(group=[{0, 1}])
 00-04Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
 00-05  Project(l_modline=[$2], l_moddate=[$0])
 00-06SelectionVectorRemover
 00-07  Filter(condition=[AND(=($0, 1992-01-01), =($1,
 1992-01-01))])
 00-08Project(l_moddate=[$2], l_shipdate=[$1],
 l_modline=[$0])
 00-09  Scan.

 - Rahul




-- 
 Steven Phillips
 Software Engineer

 mapr.com

[jira] [Created] (DRILL-3487) MaterializedField equality doesn't check if nested fields are equal

2015-07-09 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3487:
--

 Summary: MaterializedField equality doesn't check if nested fields 
are equal
 Key: DRILL-3487
 URL: https://issues.apache.org/jira/browse/DRILL-3487
 Project: Apache Drill
  Issue Type: Bug
  Components: Metadata
Reporter: Steven Phillips
Assignee: Hanifi Gunes


In several places, we use BatchSchema.equals() to determine if two schemas are 
the same. A BatchSchema is a set of MaterializedField objects. But ever since 
DRILL-1872, the child fields are no longer checked.

What this means, essentially, is that BatchSchema.equals() is not valid for 
determining schema changes if the batch contains any nested fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 34374: DRILL-3133: MergingRecordBatch can leak memory if query is canceled before batches in rawBatches were loaded

2015-07-08 Thread Steven Phillips

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34374/#review90869
---

Ship it!

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/mergereceiver/MergingRecordBatch.java
(line 307)
https://reviews.apache.org/r/34374/#comment144026

We should remove this if block altogether. It's clearly not doing anything.

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/mergereceiver/MergingRecordBatch.java
(line 339)
https://reviews.apache.org/r/34374/#comment144025

It probably makes sense to release the batch there, but it's not necessary
because the RecordBatchLoader releases the buffers when it loads the new ones,
or when close() is called. So there is no memory leak here.

- Steven Phillips

On May 28, 2015, 11:54 a.m., abdelhakim deneche wrote:

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34374/
---

(Updated May 28, 2015, 11:54 a.m.)

Review request for drill and Steven Phillips.

Bugs: DRILL-3133
https://issues.apache.org/jira/browse/DRILL-3133

Repository: drill-git

Description
---

MergingRecordBatch stores batches in an array list before loading them with
RecordBatchLoader. If the query is canceled before all received batches are
loaded, some of the batches won't be cleaned up.

lines 307 and 339 contain questions to the reviewers. I will update the patch
accordingly

Diffs
-

exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/mergereceiver/MergingRecordBatch.java
baf9bda

Diff: https://reviews.apache.org/r/34374/diff/

Testing
---

all unit tests are passing along with functional and tpch100

Thanks,

abdelhakim deneche

[jira] [Created] (DRILL-3477) Using IntVector for null expressions causes problems with implicit cast

2015-07-08 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3477:
--

 Summary: Using IntVector for null expressions causes problems with 
implicit cast
 Key: DRILL-3477
 URL: https://issues.apache.org/jira/browse/DRILL-3477
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips


See DRILL-3353, for example.

A simple example is this:

{code}
select * from t where a = 's';
{code}

If the first batch scanned from table t does not contain the column a, the 
expression materializer in Project defaults to Nullable Int as the type. The 
Filter then sees an Equals expression between a VarChar and an Int type, so it 
does an implicit cast. Implicit cast rules give Int higher precedence, so the 
literal 's' is cast to Int, which ends up throwing a NumberFormatException.

In the class ResolverTypePrecedence, we see that Null type has the lowest 
precedence, which makes sense. But since we don't actually currently have an 
implementation for NullVector, we should materialize the Null type as the 
Vector with the lowest possible precedence, which is VarBinary.

My suggestion is that we should use VarBinary as the default type in 
ExpressionMaterializer instead of Int.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Review Request 36229: Patch for DRILL-1750

2015-07-06 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36229/
---

Review request for drill.


Bugs: DRILL-1750
https://issues.apache.org/jira/browse/DRILL-1750


Repository: drill-git


Description
---

DRILL-1750: Set value count on all outgoing vectors in ScanBatch


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
6bf1280ae09045a4d73d566c25d624acced6a68d 
  
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 bb1af9eb2e6ab4950c166b8057680fff175c7a3f 
  exec/java-exec/src/test/resources/jsoninput/1750/a.json PRE-CREATION 
  exec/java-exec/src/test/resources/jsoninput/1750/b.json PRE-CREATION 

Diff: https://reviews.apache.org/r/36229/diff/


Testing
---


Thanks,

Steven Phillips

Re: Review Request 36229: Patch for DRILL-1750

2015-07-06 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36229/
---

(Updated July 6, 2015, 11:39 p.m.)


Review request for drill.


Bugs: DRILL-1750
https://issues.apache.org/jira/browse/DRILL-1750


Repository: drill-git


Description
---

DRILL-1750: Set value count on all outgoing vectors in ScanBatch


Diffs (updated)
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
6bf1280ae09045a4d73d566c25d624acced6a68d 
  
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 bb1af9eb2e6ab4950c166b8057680fff175c7a3f 
  exec/java-exec/src/test/resources/jsoninput/1750/a.json PRE-CREATION 
  exec/java-exec/src/test/resources/jsoninput/1750/b.json PRE-CREATION 

Diff: https://reviews.apache.org/r/36229/diff/


Testing
---


Thanks,

Steven Phillips

Review Request 36222: Patch for DRILL-3393

2015-07-06 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36222/
---

Review request for drill.


Bugs: DRILL-3393
https://issues.apache.org/jira/browse/DRILL-3393


Repository: drill-git


Description
---

DRILL-3393: Fix bug with quotes in TSV files


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
 38995093ed3cf60a9e84a95173bfb85611145f28 
  
exec/java-exec/src/test/java/org/apache/drill/exec/store/text/TestNewTextReader.java
 76674f97f92ddc3e26e9a3789212c1b7708ec770 
  exec/java-exec/src/test/resources/textinput/input3.tsv PRE-CREATION 

Diff: https://reviews.apache.org/r/36222/diff/


Testing
---


Thanks,

Steven Phillips

Review Request 36233: Patch for DRILL-3202

2015-07-06 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36233/
---

Review request for drill.


Bugs: DRILL-3202
https://issues.apache.org/jira/browse/DRILL-3202


Repository: drill-git


Description
---

DRILL-3202: Handle outer array in CountingJsonReader


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 dfc4f3af9b3aa12e0951eb00167a993c7ad06148 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/reader/BaseJsonProcessor.java
 78336315259e77df01738c27fd0779170c31e5fd 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/reader/CountingJsonReader.java
 c4ab1eea0543fe536b18755392a5c087fb8eaded 
  
exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/JsonReader.java
 5c03c0281981431fe9f4f1807eb91af10f26747d 
  
exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java
 bb1af9eb2e6ab4950c166b8057680fff175c7a3f 
  exec/java-exec/src/test/resources/jsoninput/countOuterArray.json PRE-CREATION 

Diff: https://reviews.apache.org/r/36233/diff/


Testing
---


Thanks,

Steven Phillips

[jira] [Resolved] (DRILL-1816) Scan Error with JSON on large no of records with Complex Types

2015-07-06 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-1816.

Resolution: Cannot Reproduce

 Scan Error with JSON on large no of records with Complex Types
 --

 Key: DRILL-1816
 URL: https://issues.apache.org/jira/browse/DRILL-1816
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - JSON
Reporter: Rahul Challapalli
Assignee: Steven Phillips
 Fix For: 1.2.0

 Attachments: complex.log


 git.commit.id.abbrev=4a4f54a
 Memory Settings
 {code}
 DRILL_MAX_DIRECT_MEMORY=32G
 DRILL_MAX_HEAP=4G
 {code}
 Dataset :
 {code}
 {
   data : {
 col1 : {
   one : [1,2,3,4],
   two : [{a:b},{c:d}]
 }
   }
 }
 {code}
 The below query works fine for the above record. However if we copy the same 
 record 100,000 times, it fails with IOOB exception
 {code}
 select data from `json_kvgenflatten/kvgen-complex-large.json`;
 {code}
 Attached the logs. Let me know if you need anything more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-2760) Quoted strings from CSV file appear in query output in different forms

2015-07-06 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-2760.

Resolution: Fixed

Fixed before 48d8a59

 Quoted strings from CSV file appear in query output in different forms
 --

 Key: DRILL-2760
 URL: https://issues.apache.org/jira/browse/DRILL-2760
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Text  CSV
Affects Versions: 0.9.0
 Environment: | 9d92b8e319f2d46e8659d903d355450e15946533 | DRILL-2580: 
 Exit early from HashJoinBatch if build side is empty | 26.03.2015 @ 16:13:53 
 EDT
 4 node cluster on CentOS
Reporter: Khurram Faraaz
Assignee: Steven Phillips
 Fix For: 1.2.0


 Quoted strings appear in query output in different forms, as shown in the 
 section below.
 Quotes should NOT appear in query output. Strings must be stripped of their 
 leading and prevailing quotes. (I am referring to this character -  )
 Snippet of data from airports.cv file, first three lines, the first line has 
 header information.
 {code}
 [root@centos-01 airport_CSV_data]# head -3 airports.csv
 id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
 6523,00A,heliport,Total Rf 
 Heliport,40.07080078125,-74.9336013793945,11,NA,US,US-PA,Bensalem,no,00A,,00A,,,
 6524,00AK,small_airport,Lowell 
 Field,59.94919968,-151.695999146,450,NA,US,US-AK,Anchor 
 Point,no,00AK,,00AK,,,
 {code}
 case 1) In this case quotes are not escaped, they appear in the output as is.
 {code}
 0: jdbc:drill: select columns[0] id,columns[1] ident,columns[2] 
 type,columns[3] name,columns[4] latitude_deg,columns[5] 
 longitude_deg,columns[6] elevation_ft,columns[7] continent,columns[8] 
 iso_country,columns[9] iso_region,columns[10] municipality,columns[11] 
 scheduled_service,columns[12] gps_code,columns[13] iata_code, columns[14] 
 local_code,columns[15] home_link,columns[16] wikipedia_link,columns[17] 
 keywords from `airports.csv` limit 3;
 +++++--+---+--++-++--+---+++++++
 | id |   ident|type|name| latitude_deg | 
 longitude_deg | elevation_ft | continent  | iso_country | iso_region | 
 municipality | scheduled_service |  gps_code  | iata_code  | local_code | 
 home_link  | wikipedia_link |  keywords  |
 +++++--+---+--++-++--+---+++++++
 | id   | ident| type | name | latitude_deg | 
 longitude_deg | elevation_ft | continent | iso_country | iso_region 
 | municipality | scheduled_service | gps_code | iata_code | 
 local_code | home_link | wikipedia_link | keywords |
 | 6523   | 00A  | heliport | Total Rf Heliport | 40.07080078125 
 | -74.9336013793945 | 11   | NA   | US| US-PA| 
 Bensalem   | no  | 00A  || 00A  | 
|| null   |
 | 6524   | 00AK | small_airport | Lowell Field | 59.94919968  | 
 -151.695999146 | 450  | NA   | US| US-AK| 
 Anchor Point | no  | 00AK || 00AK |   
  || null   |
 +++++--+---+--++-++--+---+++++++
 3 rows selected (0.155 seconds)
 {code}
 In this case quotes appear in the query output but they are escaped with 
 backslash character in the output.
 {code}
 0: jdbc:drill: select * from `airports.csv` limit 3;
 ++
 |  columns   |
 ++
 | 
 [\id\,\ident\,\type\,\name\,\latitude_deg\,\longitude_deg\,\elevation_ft\,\continent\,\iso_country\,\iso_region\,\municipality\,\scheduled_service\,\gps_code\,\iata_code\,\local_code\,\home_link\,\wikipedia_link\,\keywords\]
  |
 | [6523,\00A\,\heliport\,\Total Rf 
 Heliport\,40.07080078125,-74.9336013793945,11,\NA\,\US\,\US-PA\,\Bensalem\,\no\,\00A\,,\00A\,,]
  |
 | [6524,\00AK\,\small_airport\,\Lowell 
 Field\,59.94919968,-151.695999146,450,\NA\,\US\,\US-AK\,\Anchor
  Point\,\no\,\00AK\,,\00AK\,,] |
 ++
 3 rows selected (0.097 seconds)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Time for a 1.1 vote soon?

2015-06-30 Thread Steven Phillips

I just had a clean run on my Linux machine.

On Tue, Jun 30, 2015 at 9:32 PM, Parth Chandra pchan...@maprtech.com
wrote:

 I just completed a clean build and test run (  mvn clean install) on my
 Mac.
 Will try on Linux.




 On Tue, Jun 30, 2015 at 9:24 PM, Jacques Nadeau jacq...@apache.org
 wrote:

  I ran twice more.  So here are my three results:
 
  1. Errors from previous email.
  2. Tests hung indefinitely (mvn clean install).
  3. I had one test failure, (mvn clean; mvn install),
 
  Tests in error:
TestSpoolingBuffer.testMultipleExchangesSingleThread:50 »  test timed
 out
  afte...
 
  Is anybody having consistent completions all the way through to the
  distribution module?
 
 
 
  On Tue, Jun 30, 2015 at 8:18 PM, Aman Sinha asi...@maprtech.com wrote:
 
   I re-ran on my mac and don't see the failures your are seeing.  I got
 one
   error below related to zookeeper but I believe this is intermittent.
  
   $mvn install
   
  
   Tests in error:
 TestPStoreProviders.verifyZkStore:55 » Runtime Failure while
 accessing
   Zookeep...
  
   Tests run: 1310, Failures: 0, Errors: 1, Skipped: 114
  
   $ java -version
   java version 1.7.0_45
   Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
   Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
  
   On Tue, Jun 30, 2015 at 6:47 PM, Jacques Nadeau jacq...@apache.org
   wrote:
  
I'm seeing failures running the build on master on a mac:
   
$ java -version
java version 1.7.0_80
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
   
$mvn install
   
...
   
Failed tests:
  TestDrillbitResilience.memoryLeaksWhenCancelled:890 We are leaking
  1812
bytes expected:0 but was:1812
   
Tests in error:
   
  
  TestMergingReceiver.twoBitTwoExchange:84-Object.wait:503-Object.wait:-2
»  t...
  TestMergingReceiver.testMultipleProvidersMixedSizes:98 »  test
 timed
   out
after...
  TestSpoolingBuffer.testMultipleExchangesSingleThread:50 »  test
 timed
   out
afte...
  TestJoinNullable.testMergeLOJNullableOneOrderedInputDescNullsLast »
UserRemote
   
   
   
  
 
 TestUnionAll.testFilterPushDownOverUnionAll:545-BaseTestQuery.testSqlWithResults:265-BaseTestQuery.testRunAndReturn:278
»
  TestUnionAllBaseTestQuery.closeClient:233 » IllegalState Attempted
  to
close a...
   
On Tue, Jun 30, 2015 at 6:05 PM, Jacques Nadeau jacq...@apache.org
wrote:
   
 Agreed.  I'll spin a release.

 On Tue, Jun 30, 2015 at 6:01 PM, Parth Chandra par...@apache.org
wrote:

 Hey guys,

   Looks like 1.1 is looking fairly good with about 119 issues
  fixed. I
 would recommend we start the release process for 1.1.

 Parth

 On Fri, Jun 26, 2015 at 8:49 AM, Jacques Nadeau 
 jacq...@apache.org
  
 wrote:

  Hey Guys,
 
  Looks like a number things are being wrapped up so it is
 probably
about
  time for a 1.1 release. Shall we branch in the next day or two
 and
   put
 1.1
  to a vote?
 
  Jacques
 



   
  
 




-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Review Request 36019: Patch for DRILL-3418

2015-06-29 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36019/
---

(Updated June 30, 2015, 12:09 a.m.)


Review request for drill.


Bugs: DRILL-3418
https://issues.apache.org/jira/browse/DRILL-3418


Repository: drill-git


Description
---

DRILL-3414: Make sure to walk entire expression tree when rewriting filter 
expression for pruning


Diffs (updated)
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/RewriteAsBinaryOperators.java
 44b9a3a8fbd22744f98b9b4b64c9b7aceae7587a 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/RewriteCombineBinaryOperators.java
 247ad8f0fa883f6c765d94edac58d1b5e2193ddb 
  exec/java-exec/src/test/java/org/apache/drill/TestCTASPartitionFilter.java 
48d7cebb26d2bf08baff39d6232e4829bd98d648 

Diff: https://reviews.apache.org/r/36019/diff/


Testing
---


Thanks,

Steven Phillips

[jira] [Resolved] (DRILL-3410) Partition Pruning : We are doing a prune when we shouldn't

2015-06-29 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3410.

Resolution: Fixed

Fixed by c1998605dc2acdc5fa55792a279a473ff890a010

 Partition Pruning : We are doing a prune when we shouldn't
 --

 Key: DRILL-3410
 URL: https://issues.apache.org/jira/browse/DRILL-3410
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning  Optimization
Reporter: Rahul Challapalli
Assignee: Steven Phillips
Priority: Critical
 Fix For: 1.1.0

 Attachments: DRILL-3410.patch, DRILL-3410_part2.patch, 
 DRILL-3410_part2.patch, DRILL-3410_part2.patch


 git.commit.id.abbrev=60bc945
 The below plan does not look right. It should scan all the files based on the 
 filters in the query. Also hive returned more rows than drill
 {code}
 explain plan for select * from `existing_partition_pruning/lineitempart` 
 where (dir0=1993 and columns[0] 29600) or (dir0=1994 or columns[0]29700);
 | 00-00Screen
 00-01  Project(*=[$0])
 00-02Project(T70¦¦*=[$0])
 00-03  SelectionVectorRemover
 00-04Filter(condition=[OR(AND(=($1, 1993), (ITEM($2, 0), 
 29600)), =($1, 1994), (ITEM($2, 0), 29700))])
 00-05  Project(T70¦¦*=[$0], dir0=[$1], columns=[$2])
 00-06Scan(groupscan=[ParquetGroupScan 
 [entries=[ReadEntryWithPath 
 [path=/drill/testdata/ctas_auto_partition/existing_partition_pruning/lineitempart/0_0_3.parquet],
  ReadEntryWithPath 
 [path=/drill/testdata/ctas_auto_partition/existing_partition_pruning/lineitempart/0_0_4.parquet]],
  
 selectionRoot=/drill/testdata/ctas_auto_partition/existing_partition_pruning/lineitempart,
  numFiles=2, columns=[`*`]]])
  |
 {code}
 I attached the data set used. Let me know if you need anything more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Review Request 36019: Patch for DRILL-3418

2015-06-29 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36019/
---

Review request for drill.


Bugs: DRILL-3418
https://issues.apache.org/jira/browse/DRILL-3418


Repository: drill-git


Description
---

DRILL-3414: Make sure to walk entire expression tree when rewriting filter 
expression for pruning


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/RewriteAsBinaryOperators.java
 44b9a3a8fbd22744f98b9b4b64c9b7aceae7587a 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/RewriteCombineBinaryOperators.java
 247ad8f0fa883f6c765d94edac58d1b5e2193ddb 
  exec/java-exec/src/test/java/org/apache/drill/TestCTASPartitionFilter.java 
48d7cebb26d2bf08baff39d6232e4829bd98d648 

Diff: https://reviews.apache.org/r/36019/diff/


Testing
---


Thanks,

Steven Phillips

Review Request 35973: Patch for DRILL-3410

2015-06-27 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35973/
---

Review request for drill.


Bugs: DRILL-3410
https://issues.apache.org/jira/browse/DRILL-3410


Repository: drill-git


Description
---

DRILL-3410: rewrite OR and AND operators to have only 2 operands so partitoning 
pruning will work correctly


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ae183313f5bcfc394ffb61c766562f9102cf5a87 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/RewriteAsBinaryOperators.java
 PRE-CREATION 
  exec/java-exec/src/test/java/org/apache/drill/TestCTASPartitionFilter.java 
3943426ed7763986f131dfc2428b994935ced305 

Diff: https://reviews.apache.org/r/35973/diff/


Testing
---


Thanks,

Steven Phillips

Re: Review Request 35973: Patch for DRILL-3410

2015-06-27 Thread Steven Phillips



 On June 27, 2015, 9:35 p.m., Aman Sinha wrote:
  exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java,
   line 342
  https://reviews.apache.org/r/35973/diff/1/?file=993957#file993957line342
 
  Interestingly, when DrillOptiq.visitCall() processes the operator, it 
  appears as BINARY even though it could have more than than 2 inputs.  One 
  could enhance the PruneScanRule to handle N input boolean operators 
  (without the RewriteAsBinaryOperators) or one could potentially move the 
  RewriteAsBinaryOperators to DrillOptiq such that the rewrite is done 
  up-front.  However, these can be future tasks to consider, not an issue for 
  this patch.

Your suggestions make sense, but in this case, I wanted to make the simplest, 
least-impactful change possible, at least for now.


- Steven


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35973/#review89636
---


On June 27, 2015, 6:30 p.m., Steven Phillips wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35973/
 ---
 
 (Updated June 27, 2015, 6:30 p.m.)
 
 
 Review request for drill.
 
 
 Bugs: DRILL-3410
 https://issues.apache.org/jira/browse/DRILL-3410
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 DRILL-3410: rewrite OR and AND operators to have only 2 operands so 
 partitoning pruning will work correctly
 
 
 Diffs
 -
 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
  ae183313f5bcfc394ffb61c766562f9102cf5a87 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/RewriteAsBinaryOperators.java
  PRE-CREATION 
   exec/java-exec/src/test/java/org/apache/drill/TestCTASPartitionFilter.java 
 3943426ed7763986f131dfc2428b994935ced305 
 
 Diff: https://reviews.apache.org/r/35973/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steven Phillips

[jira] [Resolved] (DRILL-3376) Reading individual files created by CTAS with partition causes an exception

2015-06-26 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3376.

Resolution: Fixed

Fixed by 5f0e4cbd0f49600c41abf38056bcd29849c5cdf9

 Reading individual files created by CTAS with partition causes an exception
 ---

 Key: DRILL-3376
 URL: https://issues.apache.org/jira/browse/DRILL-3376
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Writer
Affects Versions: 1.1.0
Reporter: Parth Chandra
Assignee: Steven Phillips
 Fix For: 1.1.0


 Create a table using CTAS with partitioning:
 {code}
 create table `lineitem_part` partition by (l_moddate) as select l.*, 
 l_shipdate - extract(day from l_shipdate) + 1 l_moddate from 
 cp.`tpch/lineitem.parquet` l
 {code}
 Then the following query causes an exception
 {code}
 select distinct l_moddate from `lineitem_part/0_0_1.parquet` where l_moddate 
 = date '1992-01-01';
 {code}
 Trace in the log file - 
 {panel}
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
 range: 0
 at java.lang.String.charAt(String.java:658) ~[na:1.7.0_65]
 at 
 org.apache.drill.exec.planner.logical.partition.PruneScanRule$PathPartition.init(PruneScanRule.java:493)
  ~[drill-java-exec-1.1.0-SNAPSHOT.jar:1.1.0-SNAPSHOT]
 at 
 org.apache.drill.exec.planner.logical.partition.PruneScanRule.doOnMatch(PruneScanRule.java:385)
  ~[drill-java-exec-1.1.0-SNAPSHOT.jar:1.1.0-SNAPSHOT]
 at 
 org.apache.drill.exec.planner.logical.partition.PruneScanRule$4.onMatch(PruneScanRule.java:278)
  ~[drill-java-exec-1.1.0-SNAPSHOT.jar:1.1.0-SNAPSHOT]
 at 
 org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:228)
  ~[calcite-core-1.1.0-drill-r9.jar:1.1.0-drill-r9]
 ... 13 common frames omitted
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-3402) Throw exception when attempting to partition for format that don't support

2015-06-26 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3402:
--

 Summary: Throw exception when attempting to partition for format 
that don't support
 Key: DRILL-3402
 URL: https://issues.apache.org/jira/browse/DRILL-3402
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips


CTAS auto-partitioning only works with Parquet output, so we need to make sure 
we catch it if the output format is set to something other than Parquet. Since 
CTAS is only supported for the FileSystem storage, that means we only have to 
handle it for the various FormatPlugins.

I will add a method to the FormatPlugin interface, supportAutoPartitioning(), 
which will indicate whether it is supported. If it is not supported, and the 
statement contains a partition clause, it will throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Review Request 35941: Patch for DRILL-3402

2015-06-26 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35941/
---

Review request for drill.


Bugs: DRILL-3402
https://issues.apache.org/jira/browse/DRILL-3402


Repository: drill-git


Description
---

DRILL-3402: Throw exception when attempting to partition in format that doesn't 
support it


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/FileSystemCreateTableEntry.java
 672092d09a2bde5aba65b2c1d06b73316f9e5778 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatPlugin.java 
81f9f7610b1dfb81ebe2b7edc432cb082b869298 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyFormatPlugin.java
 2918ca7c8df3f0cac26573060a17a452dd3551e0 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 eff78724c6edfd4a7bffd8e78bf9cf1022e8ce75 

Diff: https://reviews.apache.org/r/35941/diff/


Testing
---


Thanks,

Steven Phillips

Re: Review Request 35960: DRILL-3307: Query with window function runs out of memory

2015-06-26 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35960/#review89600
---

Ship it!


Ship It!

- Steven Phillips


On June 27, 2015, 1:12 a.m., abdelhakim deneche wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35960/
 ---
 
 (Updated June 27, 2015, 1:12 a.m.)
 
 
 Review request for drill and Steven Phillips.
 
 
 Bugs: DRILL-3307
 https://issues.apache.org/jira/browse/DRILL-3307
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 Fixed sort to only use copier allocator when spilling to disk
 
 
 Diffs
 -
 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java
  02a1c08 
 
 Diff: https://reviews.apache.org/r/35960/diff/
 
 
 Testing
 ---
 
 ongoing...
 
 
 Thanks,
 
 abdelhakim deneche

Re: Why BatchSchema.getColumn(int) accepts out-of-range index values?

2015-06-26 Thread Steven Phillips

I don't think there is a good reason. I think we should throw an exception
if out of range. In the few places that method is used, it seems the
expectation that the method will always return a non-null value.

On Fri, Jun 26, 2015 at 3:33 PM, Daniel Barclay dbarc...@maprtech.com
wrote:

 Why does Why BatchSchema.getColumn(int index) accept out-of-range values of
 index?

 Thanks,
 Daniel
 --
 Daniel Barclay
 MapR Technologies




-- 
 Steven Phillips
 Software Engineer

 mapr.com

[jira] [Created] (DRILL-3366) Short circuit of OR expression causes incorrect partitioning

2015-06-24 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3366:
--

 Summary: Short circuit of OR expression causes incorrect 
partitioning
 Key: DRILL-3366
 URL: https://issues.apache.org/jira/browse/DRILL-3366
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Codegen
Reporter: Steven Phillips
Assignee: Steven Phillips


CTAS partitioning relies on evaluating the expression newPartitionValue(column 
A) || newPartitionValue(column B) || ..

to determine if there is a new partition should start. The newPartitionValue 
function returns true if the current value of the expression is different from 
the previous value. The function holds some state in the workspace (the 
previous value), and thus needs to be evaluated every time. Short circuit 
expression evaluation causes this to not be the case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 35739: Patch for DRILL-3333

2015-06-23 Thread Steven Phillips



 On June 22, 2015, 10:25 p.m., Jacques Nadeau wrote:
  exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java,
   line 278
  https://reviews.apache.org/r/35739/diff/1/?file=989875#file989875line278
 
  Is creating the metadata converter repeatedly expensive?

It's not expensive, but I will go ahead and reuse it anyway, as it looks 
cleaner.


- Steven


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35739/#review88842
---


On June 22, 2015, 10:22 p.m., Steven Phillips wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35739/
 ---
 
 (Updated June 22, 2015, 10:22 p.m.)
 
 
 Review request for drill.
 
 
 Bugs: DRILL-
 https://issues.apache.org/jira/browse/DRILL-
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 DRILL-: Parquet writer auto-partitioning and partition pruning
 
 Conflicts:
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/WriterPrel.java
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
   exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java
 
 
 Diffs
 -
 
   exec/java-exec/src/main/codegen/templates/AbstractRecordWriter.java 
 6b6065f6b6c8469aa548acf194e0621b9f4ffea8 
   exec/java-exec/src/main/codegen/templates/EventBasedRecordWriter.java 
 797f3cb8c83a89821ee46ce0b093f81406fa6067 
   exec/java-exec/src/main/codegen/templates/NewValueFunctions.java 
 PRE-CREATION 
   exec/java-exec/src/main/codegen/templates/RecordWriter.java 
 c6325fd0a5c7d7cb5f3628df1ecf9c01c264ed52 
   exec/java-exec/src/main/codegen/templates/StringOutputRecordWriter.java 
 f704cca0e4d62ca1435df84d9eb1b07b32ea8b39 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScan.java
  5c4ee4da9e0542244b0f71a520cea1c3a2d49a66 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/GroupScan.java
  2d16cd01b94ed8a5463c0e2fb896f019133f7f03 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/WriterRecordBatch.java
  d5d64a722ed6d9b5d97158046e6838f07c0d5381 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
  PRE-CREATION 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java
  d9b1354492454dcd2630c72f5dbc1c3badf958c7 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/ParquetPruneScanRule.java
  PRE-CREATION 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
  920b2848d8edb62667b880e81f5aee12b459d63a 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/AutoPartitioner.java 
 PRE-CREATION 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/NewValueFunction.java
  PRE-CREATION 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JsonRecordWriter.java
  a43a4a0f21bf11f29b6385e36db4d25003ffa98f 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
  cf39518b2a8b4564504a3971d1f89c268aee4b30 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
  621f05c4d50ecf83071a5df414be88e7471f0490 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/text/DrillTextRecordWriter.java
  31b1fbe9e03282161ee125cb7a4b2f53c8a8da63 
 
 Diff: https://reviews.apache.org/r/35739/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Steven Phillips

Re: Review Request 35739: Patch for DRILL-3333

2015-06-22 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35739/
---

(Updated June 22, 2015, 10:22 p.m.)


Review request for drill.


Bugs: DRILL-
https://issues.apache.org/jira/browse/DRILL-


Repository: drill-git


Description
---

DRILL-: Parquet writer auto-partitioning and partition pruning

Conflicts:

exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/WriterPrel.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java


Diffs (updated)
-

  exec/java-exec/src/main/codegen/templates/AbstractRecordWriter.java 
6b6065f6b6c8469aa548acf194e0621b9f4ffea8 
  exec/java-exec/src/main/codegen/templates/EventBasedRecordWriter.java 
797f3cb8c83a89821ee46ce0b093f81406fa6067 
  exec/java-exec/src/main/codegen/templates/NewValueFunctions.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/RecordWriter.java 
c6325fd0a5c7d7cb5f3628df1ecf9c01c264ed52 
  exec/java-exec/src/main/codegen/templates/StringOutputRecordWriter.java 
f704cca0e4d62ca1435df84d9eb1b07b32ea8b39 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScan.java
 5c4ee4da9e0542244b0f71a520cea1c3a2d49a66 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/GroupScan.java 
2d16cd01b94ed8a5463c0e2fb896f019133f7f03 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/WriterRecordBatch.java
 d5d64a722ed6d9b5d97158046e6838f07c0d5381 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
 PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java
 d9b1354492454dcd2630c72f5dbc1c3badf958c7 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/ParquetPruneScanRule.java
 PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
 920b2848d8edb62667b880e81f5aee12b459d63a 
  exec/java-exec/src/main/java/org/apache/drill/exec/store/AutoPartitioner.java 
PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/NewValueFunction.java 
PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JsonRecordWriter.java
 a43a4a0f21bf11f29b6385e36db4d25003ffa98f 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 cf39518b2a8b4564504a3971d1f89c268aee4b30 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 621f05c4d50ecf83071a5df414be88e7471f0490 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/text/DrillTextRecordWriter.java
 31b1fbe9e03282161ee125cb7a4b2f53c8a8da63 

Diff: https://reviews.apache.org/r/35739/diff/


Testing
---


Thanks,

Steven Phillips

Review Request 35739: Patch for DRILL-3333

2015-06-22 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35739/
---

Review request for drill.


Bugs: DRILL-
https://issues.apache.org/jira/browse/DRILL-


Repository: drill-git


Description
---

DRILL-: Parquet writer auto-partitioning and partition pruning

Conflicts:

exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/WriterPrel.java

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java


Diffs
-

  exec/java-exec/src/main/codegen/templates/AbstractRecordWriter.java 
6b6065f6b6c8469aa548acf194e0621b9f4ffea8 
  exec/java-exec/src/main/codegen/templates/EventBasedRecordWriter.java 
797f3cb8c83a89821ee46ce0b093f81406fa6067 
  exec/java-exec/src/main/codegen/templates/NewValueFunctions.java PRE-CREATION 
  exec/java-exec/src/main/codegen/templates/RecordWriter.java 
c6325fd0a5c7d7cb5f3628df1ecf9c01c264ed52 
  exec/java-exec/src/main/codegen/templates/StringOutputRecordWriter.java 
f704cca0e4d62ca1435df84d9eb1b07b32ea8b39 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScan.java
 5c4ee4da9e0542244b0f71a520cea1c3a2d49a66 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/GroupScan.java 
2d16cd01b94ed8a5463c0e2fb896f019133f7f03 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/WriterRecordBatch.java
 d5d64a722ed6d9b5d97158046e6838f07c0d5381 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
 PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillRuleSets.java
 d9b1354492454dcd2630c72f5dbc1c3badf958c7 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/ParquetPruneScanRule.java
 PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
 920b2848d8edb62667b880e81f5aee12b459d63a 
  exec/java-exec/src/main/java/org/apache/drill/exec/store/AutoPartitioner.java 
PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/NewValueFunction.java 
PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JsonRecordWriter.java
 a43a4a0f21bf11f29b6385e36db4d25003ffa98f 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 cf39518b2a8b4564504a3971d1f89c268aee4b30 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 621f05c4d50ecf83071a5df414be88e7471f0490 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/text/DrillTextRecordWriter.java
 31b1fbe9e03282161ee125cb7a4b2f53c8a8da63 

Diff: https://reviews.apache.org/r/35739/diff/


Testing
---


Thanks,

Steven Phillips

Re: [DISCUSS] Allowing the option to use github pull requests in place of reviewboard

2015-06-22 Thread Steven Phillips

+1

I am in favor of giving this a try.

If I remember correctly, the reason we abandoned pull requests originally
was because we couldn't close the pull requests through Github. A solution
could be for whoever pushes the commit to the apache git repo to add the
Line Closes request number. Github would then automatically close the
pull request.

On Mon, Jun 22, 2015 at 1:02 PM, Jason Altekruse altekruseja...@gmail.com
wrote:

 Hello Drill developers,

 I am writing this message today to propose allowing the use of github pull
 requests to perform reviews in place of the apache reviewboard instance.

 Reviewboard has caused a number of headaches in the past few months, and I
 think its time to evaluate the benefits of the apache infrastructure
 relative to the actual cost of using it in practice.

 For clarity of the discussion, we cannot use the complete github workflow.
 Comitters will still need to use patch files, or check out the branch used
 in the review request and push to apache master manually. I am not
 advocating for using a merging strategy with git, just for using the github
 web UI for reviews. I expect anyone generating a chain of commits as
 described below to use the rebasing workflow we do today. Additionally devs
 should only be breaking up work to make it easier to review, we will not be
 reviewing branches that contain a bunch of useless WIP commits.

 A few examples of problems I have experienced with reviewboard include:
 corruption of patches when they are downloaded, the web interface showing
 inconsistent content from the raw diff, and random rejection of patches
 that are based directly on the head of apache master.

 These are all serious blockers for getting code reviewed and integrated
 into the master branch in a timely manner.

 In addition to serious bugs in reviewboard, there are a number of
 difficulties with the combination of our typical dev workflow and how
 reviewboard works with patches. As we are still adding features to Drill,
 we often have several weeks of work to submit in response to a JIRA or
 series of related JIRAs. Sometimes this work can be broken up into
 independent reviewable units, and other times it cannot. When a series of
 changes requires a mixture of refactoring and additions, the process is
 currently quite painful. Ether reviewers need to look through a giant messy
 diff, or the submitters need to do a lot of extra work. This involves not
 only organizing their work into a reviewable series of commits, but also
 generating redundant squashed versions of the intermediate work to make
 reviewboard happy.

 For a relatively simple 3 part change, this involves creating 3 reviewboard
 pages. The first will contain the first commit by itself. The second will
 have the first commits patch as a parent patch with the next change in the
 series uploaded as the core change to review. For the third change, a
 squashed version of the first two commits must be generated to serve as a
 parent patch and then the third changeset uploaded as the reviewable
 change. Frequently a change to the first commit requires regenerating all
 of these patches and uploading them to the individual review pages.

 This gets even worse with larger chains of commits.

 It would be great if all of our changes could be small units of work, but
 very frequently we want to make sure we are ready to merge a complete
 feature before starting the review process. We need to have a better way to
 manage these large review units, as I do not see the possibility of
 breaking up the work into smaller units as a likely solution. We still have
 lots of features and system cleanup to work on.

 For anyone unfamiliar, github pull requests are based on a branch you push
 to your personal fork. They give space for a general discussion, as well as
 allow commenting inline on the diff. They give a clear reference to each
 commit in the branch, allowing reviewers to see each piece of work
 individually as well as provide a squashed view to see the overall
 differences.

 For the sake of keeping the project history connected to JIRA, we can see
 if there is enough automatic github integration or possibly upload patch
 files to JIRA each time we update a pull request. As an side note, if we
 don't need individual patches for reviewboard we could just put patch files
 on JIRA that contain several commits. These are much easier to generate an
 apply than a bunch of individual files for each change. This should prevent
 JIRAs needing long lists of patches with names like
 DRILL-3000-part1-version3.patch




-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: [DISCUSSION] How can we improve the performance of Window Functions

2015-06-12 Thread Steven Phillips

Can you give us some data on what the current performance looks like, vs
what you would expect? Are we spend most of the time in the sort, or the
Window function operator?

On Thu, Jun 11, 2015 at 10:55 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Speed in many such loops depends a lot on how the loops are ordered so that
 cache and registers can be re-used.  I have no idea what will make your
 windowing functions fast, but I can say some things about what makes matrix
 math fast.

 The key with matrix multiplication is that there are n^3/2 operations to do
 on n^2 elements.  The minimum number of memory operations is n^2 which
 sounds good because modern CPU's can often do hundreds of operations per
 main memory access.  This also means that if we code a naive
 implementation, we will generally be memory bound because that will
 increase the number of memory operations to k n^3.

 To avoid that, the loops involved can be restructured so that larger and
 larger blocks of data are used.  At the lowest levels, small blocks of 2 x
 4 values or so are used to code the multiplication since all of these
 values can be kept in registers.  At one step up, the computation is
 structured to only operate on elements that fit in the fastest level of
 cache which is typically 10's of kB in size.

 Your loop looks like this:

 for (start = 0 ... end-n) {
initialize()
for (offset = 0 ... n-1) {
   aggregate(start + offset)
}
finalize()
 }

 This arrangement is pretty cache friendly if n is small enough, but it
 seems that it could be even more friendly if you kept all of the
 aggregators at the read and handed each sample to all of the aggregators
 before moving to the next position.

 On Thu, Jun 11, 2015 at 3:55 PM, Abdel Hakim Deneche 
 adene...@maprtech.com
 wrote:

  Hi all,
 
  The purpose of this email is to describe how window functions are
 computed
  and to try to come up with better ways to do it.
 
  DRILL-3200 https://issues.apache.org/jira/browse/DRILL-3200 added
  support
  for RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK and CUME_DIST but also
 made
  some significant  improvements to the way Drill computes window
 functions.
 
  The general idea was to update the code to only support the default frame
  which makes it run faster and use less memory.
 
  WindowFrameRecordBatch works similarly to StreamingAggregate: it requires
  the data to be sorted on the partition and order by columns and only
  computes one frame at a time. With the default frame we only need to
  aggregate every row only once.
  Memory consumption depend on the data, but in general each record batch
 is
  kept in memory until we are ready to process all it's rows (which is
  possible when we find the last peer row of the batch's last row). Drill's
  external sort can spill to disk if data is too big, and we only need to
  keep at most one partition's worth of data in memory for the window
  functions to be computed (when over clause doesn't contain an order by)
 
  Each time a batch is ready to be processed we do the following:
 
  1- we start with it's first row (current row)
  2- we compute the length of the current row's frame (in this case we find
  the number of peer rows for the current row),
  3- we aggregate (this includes computing the window function values) all
  rows of the current frame
  4- we write the aggregated value in each row of the current frame.
  5- We then move to the 1st non peer row which becomes the current row
  6- if we didn't reach the end of the current batch go back to 2
 
  With all this in mind, how can we improve the performance of window
  functions ?
 
  Thanks!
  --
 
  Abdelhakim Deneche
 
  Software Engineer
 
http://www.mapr.com/
 
 
  Now Available - Free Hadoop On-Demand Training
  
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
  
 




-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Window function query takes too long to complete and return results

2015-06-09 Thread Steven Phillips

In cases like this where you are printing millions of record in SQLLINE,
you should pipe the output to /dev/null or to a file, and measure the
performance that way. I'm guessing that most of the time in this case is
spent printing the output to the console, and thus really unrelated to
Drill performance. If piping the data to a file or /dev/null causes the
query to run much faster, than it probably isn't a real issue.

also, anytime you are investigating a performance related issue, you should
always check the profile. In this case, I suspect you might see that most
of the time is spent in the WAIT time of the SCREEN operator. That would
indicate that client side processing is slowing the query down.

On Tue, Jun 9, 2015 at 7:09 PM, Abdel Hakim Deneche adene...@maprtech.com
wrote:

 please open a JIRA issue. please provide the test file (compressed) or a
 script to generate similar data.

 Thanks!

 On Tue, Jun 9, 2015 at 6:55 PM, Khurram Faraaz kfar...@maprtech.com
 wrote:

  Query that uses window functions takes too long to complete and return
  results. It returns close to a million records, for which it took 533.8
  seconds ~8 minutes
  Input CSV file has two columns, one integer and another varchar type
  column. Please let me know if this needs to be investigated and I can
  report a JIRA to track this if required ?
 
  Size of the input CSV file
 
  root@centos-01 ~]# hadoop fs -ls /tmp/manyDuplicates.csv
 
  -rwxr-xr-x   3 root root   27889455 2015-06-10 01:26
  /tmp/manyDuplicates.csv
 
  {code}
 
  select count(*) over(partition by cast(columns[1] as varchar(25)) order
 by
  cast(columns[0] as bigint)) from `manyDuplicates.csv`;
 
  ...
 
  1,000,007 rows selected (533.857 seconds)
  {code}
 
  There are five distinct values in columns[1] in the CSV file. = [FIVE
  PARTITIONS]
 
  {code}
 
  0: jdbc:drill:schema=dfs.tmp select distinct columns[1] from
  `manyDuplicates.csv`;
 
  *+---+*
 
  *| **   EXPR$0** |*
 
  *+---+*
 
  *| * * |*
 
  *| * * |*
 
  *| * * |*
 
  *| * * |*
 
  *| * * |*
 
  *+---+*
 
  5 rows selected (1.906 seconds)
  {code}
 
  Here is the count for each of those values in columns[1]
 
  {code}
 
  0: jdbc:drill:schema=dfs.tmp select count(columns[1]) from
  `manyDuplicates.csv` where columns[1] = '';
 
  *+-+*
 
  *| **EXPR$0 ** |*
 
  *+-+*
 
  *| *200484 * |*
 
  *+-+*
 
  1 row selected (0.961 seconds)
 
  {code}
 
 
  {code}
 
  0: jdbc:drill:schema=dfs.tmp select count(columns[1]) from
  `manyDuplicates.csv` where columns[1] = '';
 
  *+-+*
 
  *| **EXPR$0 ** |*
 
  *+-+*
 
  *| *199353 * |*
 
  *+-+*
 
  1 row selected (0.86 seconds)
 
  {code}
 
 
  {code}
 
  0: jdbc:drill:schema=dfs.tmp select count(columns[1]) from
  `manyDuplicates.csv` where columns[1] = '';
 
  *+-+*
 
  *| **EXPR$0 ** |*
 
  *+-+*
 
  *| *200702 * |*
 
  *+-+*
 
  1 row selected (0.826 seconds)
 
  {code}
 
 
  {code}
 
  0: jdbc:drill:schema=dfs.tmp select count(columns[1]) from
  `manyDuplicates.csv` where columns[1] = '';
 
  *+-+*
 
  *| **EXPR$0 ** |*
 
  *+-+*
 
  *| *199916 * |*
 
  *+-+*
 
  1 row selected (0.851 seconds)
 
  {code}
 
 
  {code}
 
  0: jdbc:drill:schema=dfs.tmp select count(columns[1]) from
  `manyDuplicates.csv` where columns[1] = '';
 
  *+-+*
 
  *| **EXPR$0 ** |*
 
  *+-+*
 
  *| *199552 * |*
 
  *+-+*
 
  1 row selected (0.827 seconds)
  {code}
 
  Thanks,
  Khurram
 



 --

 Abdelhakim Deneche

 Software Engineer

   http://www.mapr.com/


 Now Available - Free Hadoop On-Demand Training
 
 http://www.mapr.com/training?utm_source=Emailutm_medium=Signatureutm_campaign=Free%20available
 




-- 
 Steven Phillips
 Software Engineer

 mapr.com

Re: Review Request 35026: DRILL-3246: Query planning support for partition by clause in CTAS statement

2015-06-03 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35026/#review86500
---



exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/parser/SqlCreateTable.java
https://reviews.apache.org/r/35026/#comment138544

Should we maybe use PARTITIONED BY instead, to match Hive's syntax?


- Steven Phillips


On June 3, 2015, 9:14 p.m., Jinfeng Ni wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35026/
 ---
 
 (Updated June 3, 2015, 9:14 p.m.)
 
 
 Review request for drill and Venki Korukanti.
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 Main code change :
 
 1) Modify Drill's SQL parser to allow partition by clause in CTAS statement
 2) Modify Drill's query planner to do semantics validation/checking, and 
 generate query plan to support the partition by clause.
 
 In the query plan for the CTAS statement, Drill will ensure data are sorted 
 according to the partition columns. The sort could be partial sort. 
 Therefore, multiple rows with the same partition column values could end up 
 in different partition.
 
 
 Diffs
 -
 
   exec/java-exec/src/main/codegen/includes/parserImpls.ftl 1605b06 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/CreateTableEntry.java
  673e8c6 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillWriterRel.java
  fc93c3e 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/FileSystemCreateTableEntry.java
  6784888 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/WriterPrule.java
  5790665 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/CreateTableHandler.java
  2866b8c 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/SqlHandlerUtil.java
  3edcdb2 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/parser/CompoundIdentifierConverter.java
  bfa89a5 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/parser/SqlCreateTable.java
  9fd9d92 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/parser/SqlCreateView.java
  57cfde9 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/AbstractSchema.java 
 6afce1a 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/SubSchemaWrapper.java
  4e50bc1 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemSchemaFactory.java
  fa9aa89 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatPlugin.java
  5668c54 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java
  b1135d0 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyFormatPlugin.java
  233c32b 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyWriter.java
  e12c5b3 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
  322a88d 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java
  75f0e74 
   exec/java-exec/src/test/java/org/apache/drill/TestExampleQueries.java 
 f0422d3 
 
 Diff: https://reviews.apache.org/r/35026/diff/
 
 
 Testing
 ---
 
 Unit test. 
 
 Precommit regression test.
 
 
 Thanks,
 
 Jinfeng Ni

Re: question about correlated arrays and flatten

2015-05-29 Thread Steven Phillips

I think your use case could be solved by adding a UDF that can combine
multiple arrays into a single array. The result of this function could then
be handled by our current implementation of flatten.

I think this is preferable to enhancing flatten itself to handle it, since
flatten is not an ordinary UDF, and thus more difficult to modify and
maintain.

On Fri, May 29, 2015 at 3:20 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 My particular use case can throw an error if the lists are different
 length.

 I think our real goal should be to have a logically complete set of simple
 primitives that lets any sort of back and forward conversions of this kind.




 On Fri, May 29, 2015 at 9:58 AM, Jason Altekruse altekruseja...@gmail.com
 
 wrote:

  I understand what you want to do, unfortunately we don't have support for
  this right now. A UDF is the best I can suggest at this point.
 
  Just to explore the idea a little further for the sake of creating a
  complete feature request, I assume you would just want nulls filled in
 for
  the cases where the lists were different lengths?
 
  On Fri, May 29, 2015 at 8:58 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Input is here: https://gist.github.com/tdunning/07ce66e7e4d4af41afd7
  
   Output is here: https://gist.github.com/tdunning/3aa841c56bfcdc0ab90e
  
   log-synth schema for generating input data is here:
   https://gist.github.com/tdunning/638dd52c00569ffa9582
  
  
   Preferred syntax would be like
  
   select flatten(t, v1, v2) from ...
  
  
  
  
   On Fri, May 29, 2015 at 7:04 AM, Neeraja Rentachintala 
   nrentachint...@maprtech.com wrote:
  
Ted
can you pls give an example with few data elements in a, b and the
   expected
output you are looking from the query.
   
-Neeraja
   
On Fri, May 29, 2015 at 6:43 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
 I have two arrays.  Their elements are correlated times and values.
  I
 would like to flatten them into rows, each with two elements.

 The query

select flatten(a), flatten(b) from ...

 doesn't work because I get the cartesian product (of course).  The
   query

select flatten(a, b) from ...

 also doesn't work because flatten doesn't have a multi-argument
 form.

 Going crazy, this query kind of sort of almost works, but not
 really:

  select r.x.`key`, flatten(r.x.`value`)  from (

  select flatten(kvgen(x)) as x from ...) r;

 What I really want to see is something like this:
select zip(flatten(a), flatten(b)) from ...

 Any pointers?  Is my next step to write a UDF?

   
  
 




-- 
 Steven Phillips
 Software Engineer

 mapr.com

[jira] [Resolved] (DRILL-3100) TestImpersonationDisabledWithMiniDFS fails on Windows

2015-05-15 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3100.

Resolution: Fixed

Resolved in d8b1975

 TestImpersonationDisabledWithMiniDFS fails on Windows
 -

 Key: DRILL-3100
 URL: https://issues.apache.org/jira/browse/DRILL-3100
 Project: Apache Drill
  Issue Type: Bug
 Environment: {noformat}
 org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
 java.lang.IllegalArgumentException: Pathname 
 /Q:/git/apache-drill/exec/java-exec/target/1431653578758-0 from 
 hdfs://127.0.0.1:30538/Q:/git/apache-drill/exec/java-exec/target/1431653578758-0
  is not a valid DFS filename.
 [Error Id: 4f100f1c-4071-4ef0-8b77-ea5c605f7d76 on 127.0.0.1:31013]
   at 
 org.apache.drill.exec.rpc.user.QueryResultHandler.resultArrived(QueryResultHandler.java:118)
   at 
 org.apache.drill.exec.rpc.user.UserClient.handleReponse(UserClient.java:111)
   at 
 org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:47)
   at 
 org.apache.drill.exec.rpc.BasicClientWithConnection.handle(BasicClientWithConnection.java:1)
   at org.apache.drill.exec.rpc.RpcBus.handle(RpcBus.java:61)
   at 
 org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:218)
   at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:1)
   at 
 io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
   at 
 io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
 {noformat}
Reporter: Aditya Kishore
Assignee: Aditya Kishore





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3098) Set Unix style line.separator for tests

2015-05-15 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3098.

Resolution: Fixed

Resolved in 984ee01

 Set Unix style line.separator for tests
 -

 Key: DRILL-3098
 URL: https://issues.apache.org/jira/browse/DRILL-3098
 Project: Apache Drill
  Issue Type: Bug
Reporter: Aditya Kishore
Assignee: Aditya Kishore

 Both Calcite and Jackson's Object mapper uses this to format JSON and SQL. If 
 left to platform setting, some tests break on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3093) Leaking RawBatchBuffer

2015-05-15 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3093.

Resolution: Fixed

fixed in 7f575df

 Leaking RawBatchBuffer
 --

 Key: DRILL-3093
 URL: https://issues.apache.org/jira/browse/DRILL-3093
 Project: Apache Drill
  Issue Type: Bug
Reporter: Mehant Baid
Assignee: Steven Phillips
 Attachments: DRILL-3093.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3051) Integer overflow in TimedRunnable

2015-05-14 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3051.

Resolution: Pending Closed

Fixed in 83d8ebe

 Integer overflow in TimedRunnable
 -

 Key: DRILL-3051
 URL: https://issues.apache.org/jira/browse/DRILL-3051
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips
 Fix For: 1.0.0


 This can cause the timeout to become negative. Causes query to fail.
 Only see this when querying a large number of files (e.g. ~150K)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3050) Increase query context max memory

2015-05-14 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3050.

Resolution: Pending Closed

Fixed in b3d097b

 Increase query context max memory
 -

 Key: DRILL-3050
 URL: https://issues.apache.org/jira/browse/DRILL-3050
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3049) Increase sort spooling threshold

2015-05-14 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3049.

Resolution: Pending Closed

01a36f1

 Increase sort spooling threshold
 

 Key: DRILL-3049
 URL: https://issues.apache.org/jira/browse/DRILL-3049
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3051) Integer overflow in TimedRunnable

2015-05-14 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3051.

Resolution: Fixed

 Integer overflow in TimedRunnable
 -

 Key: DRILL-3051
 URL: https://issues.apache.org/jira/browse/DRILL-3051
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips
 Fix For: 1.0.0


 This can cause the timeout to become negative. Causes query to fail.
 Only see this when querying a large number of files (e.g. ~150K)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (DRILL-3050) Increase query context max memory

2015-05-14 Thread Steven Phillips (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips resolved DRILL-3050.

Resolution: Fixed

 Increase query context max memory
 -

 Key: DRILL-3050
 URL: https://issues.apache.org/jira/browse/DRILL-3050
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips
Assignee: Steven Phillips
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: Review Request 34234: DRILL-3071: fix memory leak in RecordBatchLoader#load

2015-05-14 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34234/#review83842
---

Ship it!


Ship It!

- Steven Phillips


On May 14, 2015, 9:06 p.m., Hanifi Gunes wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34234/
 ---
 
 (Updated May 14, 2015, 9:06 p.m.)
 
 
 Review request for drill, Parth Chandra and Steven Phillips.
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 DRILL-3071: fix memory leak in RecordBatchLoader#load
 
 - Clean up allocated vectors in case loading a batch fails.
 
 
 Diffs
 -
 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchLoader.java
  1b8b7cea870d252d0c16babc975de5d2caa56d42 
 
 Diff: https://reviews.apache.org/r/34234/diff/
 
 
 Testing
 ---
 
 unit + regression
 
 
 Thanks,
 
 Hanifi Gunes

Review Request 34239: Patch for DRILL-3088

2015-05-14 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34239/
---

Review request for drill.


Bugs: DRILL-3088
https://issues.apache.org/jira/browse/DRILL-3088


Repository: drill-git


Description
---

DRILL-3088: Kill and cleanup remaing batches in left upstream in NestedLoopJoin


Diffs
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java
 de0d8e5bd73f0b903113b4c0cfb5f8e17eaaa8c1 

Diff: https://reviews.apache.org/r/34239/diff/


Testing
---


Thanks,

Steven Phillips

Re: Review Request 34184: DRILL-3065: Memory Leak at ExternalSortBatch

2015-05-14 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34184/#review83877
---

Ship it!


Ship It!

- Steven Phillips


On May 14, 2015, 11:14 p.m., Sean Hsuan-Yi Chu wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34184/
 ---
 
 (Updated May 14, 2015, 11:14 p.m.)
 
 
 Review request for drill, abdelhakim deneche, Jinfeng Ni, and Steven Phillips.
 
 
 Bugs: DRILL-3065
 https://issues.apache.org/jira/browse/DRILL-3065
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 Clean the SelectionVector4 in mSort after failure happens
 
 
 Diffs
 -
 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java
  529a6ca 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/MSortTemplate.java
  9acae9e 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/MSorter.java
  d97ffc0 
   
 exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java
  f95fbe1 
 
 Diff: https://reviews.apache.org/r/34184/diff/
 
 
 Testing
 ---
 
 Unit test passed; Others are running
 
 
 Thanks,
 
 Sean Hsuan-Yi Chu

Re: Review Request 34184: DRILL-3065: Memory Leak at ExternalSortBatch

2015-05-14 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34184/#review83888
---

Ship it!


Ship It!

- Steven Phillips


On May 14, 2015, 11:14 p.m., Sean Hsuan-Yi Chu wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34184/
 ---
 
 (Updated May 14, 2015, 11:14 p.m.)
 
 
 Review request for drill, abdelhakim deneche, Jinfeng Ni, and Steven Phillips.
 
 
 Bugs: DRILL-3065
 https://issues.apache.org/jira/browse/DRILL-3065
 
 
 Repository: drill-git
 
 
 Description
 ---
 
 Clean the SelectionVector4 in mSort after failure happens
 
 
 Diffs
 -
 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/ExternalSortBatch.java
  529a6ca 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/MSortTemplate.java
  9acae9e 
   
 exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/xsort/MSorter.java
  d97ffc0 
   
 exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java
  f95fbe1 
 
 Diff: https://reviews.apache.org/r/34184/diff/
 
 
 Testing
 ---
 
 Unit test passed; Others are running
 
 
 Thanks,
 
 Sean Hsuan-Yi Chu

Re: Review Request 34037: Patch for DRILL-2936

2015-05-12 Thread Steven Phillips


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34037/
---

(Updated May 13, 2015, 12:34 a.m.)


Review request for drill.


Bugs: DRILL-2936
https://issues.apache.org/jira/browse/DRILL-2936


Repository: drill-git


Description
---

DRILL-2936: Use SpoolingRawBatchBuffer for HashToMergeExchange In order to 
avoid deadlocks

Refactored common code in UnlimitedRawBatchBuffer and SpoolingRawBatchBuffer
 into BaseRawBatchBuffer

Removed reflection-based construction of RawBatchBuffer. Now use choose 
implementation
 based on plan

Updated SpoolingRawBatchBuffer to use a separate thread for spooling


Diffs (updated)
-

  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractReceiver.java
 f01d025e8d7a9e3c4fe907a1c134c951b258dc9b 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/Receiver.java 
04d6d7eea28f6ada994384a6edbace2e1bf3d863 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/BroadcastExchange.java
 a37f638bc0fe71f61eee28e05777f636d21a0822 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashToMergeExchange.java
 f45ace92357dd4ad3f787a17e32f6ad755a0794c 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashToRandomExchange.java
 52d79c22603343f8dfe66bc08817e7486f4198c5 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/MergingReceiverPOP.java
 9416814375a39419c25a3465af9dac59c179f0f3 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/OrderedPartitionExchange.java
 c8dbc225d148deee791c96ff8f602d59f8104daf 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/SingleMergeExchange.java
 c812325807bdec70f1af7c1373ca1ff40e7ac29f 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/UnionExchange.java
 b7b7835282f529e68256ea014b4da4ea9d269b87 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/UnorderedDeMuxExchange.java
 0bc6678eddc9faa4d6b09ec378adb7712cde3960 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/UnorderedMuxExchange.java
 3028ee3ed2bcb778fe41e393af7ee1f0198d1cec 
  
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/UnorderedReceiver.java
 e741dd422543a63ba9d7dcea54f822f03870be48 
  
exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java 
edd79ac818c490a344c929c7e40a7502ef93300e 
  
exec/java-exec/src/main/java/org/apache/drill/exec/store/LocalSyncableFileSystem.java
 b88cc28d2f89dcf0004809ecedcef06e5ab187d0 
  
exec/java-exec/src/main/java/org/apache/drill/exec/work/batch/AbstractDataCollector.java
 ed163146d95ac0d5a12bf246765da373668fec84 
  
exec/java-exec/src/main/java/org/apache/drill/exec/work/batch/BaseRawBatchBuffer.java
 PRE-CREATION 
  
exec/java-exec/src/main/java/org/apache/drill/exec/work/batch/RawBatchBuffer.java
 8646a72e4789a038b331509e8cdef6af20d77851 
  
exec/java-exec/src/main/java/org/apache/drill/exec/work/batch/SpoolingRawBatchBuffer.java
 07a3505f105ad01c392d834a8df2e2075c676d53 
  
exec/java-exec/src/main/java/org/apache/drill/exec/work/batch/UnlimitedRawBatchBuffer.java
 47501c7629d7b50741cf9413977f5d8cfcc6 
  exec/java-exec/src/main/resources/drill-module.conf 
d98b97a7cfd9c0667d16832b35a44368562346f6 
  
exec/java-exec/src/test/java/org/apache/drill/exec/work/batch/TestSpoolingBuffer.java
 dcea9bbecf1b7f068de61bfe897586412cfabcc0 
  
exec/java-exec/src/test/java/org/apache/drill/exec/work/batch/TestUnlimitedBatchBuffer.java
 b8336e9779d45251081128ff1450adc4a2f38576 

Diff: https://reviews.apache.org/r/34037/diff/


Testing
---


Thanks,

Steven Phillips

[jira] [Created] (DRILL-3049) Increase sort spooling threshold

2015-05-12 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3049:
--

 Summary: Increase sort spooling threshold
 Key: DRILL-3049
 URL: https://issues.apache.org/jira/browse/DRILL-3049
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-3048) Disable assertions by default

2015-05-12 Thread Steven Phillips (JIRA)

Steven Phillips created DRILL-3048:
--

 Summary: Disable assertions by default
 Key: DRILL-3048
 URL: https://issues.apache.org/jira/browse/DRILL-3048
 Project: Apache Drill
  Issue Type: Bug
Reporter: Steven Phillips






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 154 matches

Mail list logo