[GitHub] drill issue #610: DRILL-4674: Allow casting to boolean the same literals as ...

2016-11-03 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/610
  
@sudheeshkatkam 
Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread vdiravka
GitHub user vdiravka opened a pull request:

https://github.com/apache/drill/pull/644

DRILL-4980: Upgrading of the approach of parquet date correctness status 
detection

- Parquet writer version is added;
- Updated the detection method of parquet date correctness.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vdiravka/drill DRILL-4980

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/644.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #644


commit ca447255b1a8137a25b01e051053b8f354c75ad0
Author: Vitalii Diravka 
Date:   2016-10-26T12:22:06Z

DRILL-4980: Upgrading of the approach of parquet date correctness status 
detection
- Parquet writer version is added;
- Updated the detection method of parquet date correctness.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: isDateCorrect field in ParquetTableMetadata

2016-11-03 Thread Vitalii Diravka
I have opened a PR  for the
DRILL-4980 .

@Jinfeng I removed is.date.correct flag from the writer.
And I left the checking of it in detectCorruptDates() method for the reason
if somebody has already generated the parquet files with this property.

@Jason I have added the drill version checking as you suggested to do.

@Paul I have added parquet-writer.version property to parquet metadata
footer.
If you have other thoughts about variables names, please let me know I will
try to respond as fast as possible.
Kind regards
Vitalii

2016-10-31 7:07 GMT+02:00 Paul Rogers :

> Choosing a good property name should solve the confusion issue. Perhaps
> drill.writer as the name.
>
> The writer version is not needed if we feel that we’ll never again change
> the writer or introduce bugs. Since that is hard to predict, adding a
> writer version is very low cost insurance. Indeed, including a writer
> version is a very common technique. The question we should answer is why we
> don’t need to use this technique here…
>
> One can imagine that, as we evolve from 1.x to the 2.1 (or later) format,
> we may do so in fits and starts, perhaps based on community contributions.
> A version will help us know what capabilities were supported in the writer
> that wrote a particular file.
>
> Thanks,
>
> - Paul
>
> > On Oct 28, 2016, at 3:03 PM, Jason Altekruse  wrote:
> >
> > The only worry I have about declaring a writer version is possible
> > confusion with the Parquet format version itself. The format is already
> > defined through version 2.1 or something like that, but we are currently
> > only writing files based on the 1.x version of the format.
> >
> > My preferred solution to this problem would be to just make point
> releases
> > for problems like this (like in this case we could have made a 1.8.1
> > release, and then all of the 1.8.0-SNAPSHOT would all known to be bad and
> > everything after would be 1.8.1-SNAPSHOT and could have been known to be
> > correct).
> >
> > I'm open to to hearing other opinions on this, I just generally feel like
> > these bugs should be rare, and fixing them should be done with a lot of
> > care (and in this case I missed a few things). I don't think it would be
> > crazy to say that we should only merge these kinds of patches if we are
> > willing to say the fix is ready for a release.
> >
> > Jason Altekruse
> > Software Engineer at Dremio
> > Apache Drill Committer
> >
> > On Fri, Oct 28, 2016 at 2:52 PM, Vitalii Diravka <
> vitalii.dira...@gmail.com>
> > wrote:
> >
> >> Jinfeng,
> >>
> >> isDateCorrect will be false in the code when isDateCorrect property is
> >> absent in the parquet metadata.
> >>
> >> Anyway I am going to implement the mentioned approach with the
> >> parquet-writer.version instead of isDateCorrect property.
> >>
>
>


[GitHub] drill issue #644: DRILL-4980: Upgrading of the approach of parquet date corr...

2016-11-03 Thread vdiravka
Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/644
  
@parthchandra @paul-rogers @jaltekruse coud you please review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (DRILL-4944) incorrect results - case expression

2016-11-03 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-4944.
-
   Resolution: Won't Fix
Fix Version/s: (was: 1.9.0)

> incorrect results - case expression
> ---
>
> Key: DRILL-4944
> URL: https://issues.apache.org/jira/browse/DRILL-4944
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow
>Affects Versions: 1.9.0
>Reporter: Khurram Faraaz
>Assignee: Serhii Harnyk
>Priority: Critical
>
> Drill 1.9.0 (git commit id: 4edabe7a) returns null, which is wrong.
>  {noformat}
>  0: jdbc:drill:schema=dfs.tmp> SELECT res2, case res2 WHEN 0.1 THEN 0. 
> ELSE null END
> . . . . . . . . . . . . . . > FROM
> . . . . . . . . . . . . . . > (
> . . . . . . . . . . . . . . > SELECT
> . . . . . . . . . . . . . . > (CASE WHEN (false) THEN null ELSE CAST(0.1 
> as float) end) res2
> . . . . . . . . . . . . . . > FROM (values(1)) foo
> . . . . . . . . . . . . . . > ) foobar ;
> +---+-+
> | res2  | EXPR$1  |
> +---+-+
> | 0.1   | null|
> +---+-+
> 1 row selected (0.106 seconds)
>  {noformat}
>  
>  Postgres returns correct results
>   {noformat}
>  postgres=# SELECT res2, case res2 WHEN 0.1 THEN 0. ELSE null END
> postgres-# FROM
> postgres-# (
> postgres(# SELECT
> postgres(# (CASE WHEN (false) THEN null ELSE CAST(0.1 as float) end) res2
> postgres(# FROM (values(1)) foo
> postgres(# ) foobar ;
>  res2 |  case
> --+
>   0.1 | 0.
> (1 row)
>  {noformat}
>  
>  Calcite also returns correct results
>   {noformat}
>  0: jdbc:calcite:model=target/test-classes/mod> SELECT res2, case res2 WHEN 
> 0.1 THEN 0. ELSE null END
> . . . . . . . . . . . . . . . . . . . . . . .> FROM
> . . . . . . . . . . . . . . . . . . . . . . .> (
> . . . . . . . . . . . . . . . . . . . . . . .>   SELECT
> . . . . . . . . . . . . . . . . . . . . . . .>  (CASE WHEN (false) 
> THEN null ELSE CAST(0.1 as float) end) res2
> . . . . . . . . . . . . . . . . . . . . . . .>  FROM (values(1)) foo
> . . . . . . . . . . . . . . . . . . . . . . .> ) foobar ;
> +-++
> |  RES2   | EXPR$1 |
> +-++
> | 0.1 | 0. |
> +-++
> 1 row selected (1.277 seconds)
>  {noformat}
>  
>  Details of explain plan from Drill 1.9.0
>   {noformat}
>  0: jdbc:drill:schema=dfs.tmp> explain plan for SELECT res2, case res2 WHEN 
> 0.1 THEN 0. ELSE null END
> . . . . . . . . . . . . . . > FROM
> . . . . . . . . . . . . . . > (
> . . . . . . . . . . . . . . >   SELECT
> . . . . . . . . . . . . . . >  (CASE WHEN (false) THEN null ELSE 
> CAST(0.1 as float) end) res2
> . . . . . . . . . . . . . . >  FROM (values(1)) foo
> . . . . . . . . . . . . . . > ) foobar ;
> +--+--+
> | text | json |
> +--+--+
> | 00-00Screen
> 00-01  Project(res2=[$0], EXPR$1=[$1])
> 00-02Project(res2=[CASE(false, null, 0.1)], 
> EXPR$1=[CASE(=(CASE(false, null, 0.1), 0.1), 0., null)])
> 00-03  Values
>  | {
>   "head" : {
> "version" : 1,
> "generator" : {
>   "type" : "ExplainHandler",
>   "info" : ""
> },
> "type" : "APACHE_DRILL_PHYSICAL",
> "options" : [ ],
> "queue" : 0,
> "resultMode" : "EXEC"
>   },
>   "graph" : [ {
> "pop" : "Values",
> "@id" : 3,
> "content" : [ {
>   "EXPR$0" : {
> "$numberLong" : 1
>   }
> } ],
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : 1.0
>   }, {
> "pop" : "project",
> "@id" : 2,
> "exprs" : [ {
>   "ref" : "`res2`",
>   "expr" : " ( if (false ) then (NULL )  else (0.1 )  end  ) "
> }, {
>   "ref" : "`EXPR$1`",
>   "expr" : " ( if (equal( ( if (false ) then (NULL )  else (0.1 )  end  ) 
> , 0.1)  ) then (0. )  else (NULL )  end  ) "
> } ],
> "child" : 3,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : 1.0
>   }, {
> "pop" : "project",
> "@id" : 1,
> "exprs" : [ {
>   "ref" : "`res2`",
>   "expr" : "`res2`"
> }, {
>   "ref" : "`EXPR$1`",
>   "expr" : "`EXPR$1`"
> } ],
> "child" : 2,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : 1.0
>   }, {
> "pop" : "screen",
> "@id" : 0,
> "child" : 1,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : 1.0
>   } ]
> } |
>  {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill issue #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread jinfengni
Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/611
  
+1

Look good to me.

Very good documentation of design and configuration parameters. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #560: DRILL-4823: Fix OOM while trying to prune partition...

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/560#discussion_r86382118
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java ---
@@ -445,11 +447,16 @@ public PartitionExplorer getPartitionExplorer() {
   }
 
   @Override
-  public ValueHolder getConstantValueHolder(String value, 
Function holderInitializer) {
-ValueHolder valueHolder = constantValueHolderCache.get(value);
+  public ValueHolder getConstantValueHolder(String value, MinorType type, 
Function holderInitializer) {
--- End diff --

it would be nice if we could share the common codes of 
getConstantValueHolder() in QueryContext/FragmentContext, and wrap them in a 
helper function. But this is optional.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #581: DRILL-4864: Add ANSI format for date/time functions

2016-11-03 Thread adeneche
Github user adeneche commented on the issue:

https://github.com/apache/drill/pull/581
  
+1 LGTM, 

sorry for taking too long


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #560: DRILL-4823: Fix OOM while trying to prune partitions with ...

2016-11-03 Thread jinfengni
Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/560
  
+1

The revised patch looks good to me.  @arina-ielchiieva , thanks for your 
pull request!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-4995) Allow lazy init when dynamic UDF support is disabled

2016-11-03 Thread Roman (JIRA)
Roman created DRILL-4995:


 Summary: Allow lazy init when dynamic UDF support is disabled
 Key: DRILL-4995
 URL: https://issues.apache.org/jira/browse/DRILL-4995
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.9.0
Reporter: Roman
Assignee: Arina Ielchiieva


Steps in 2 nodes cluster:

In 1st node:
1. Register jar
2. Run function (success)
3. Disable dynamic UDF support 
4. Run function again (success)

In 2nd node:
5. Try to run function (failed).

In 1st node the function was initialized before disabling dynamic UDF support. 
But in 2nd node the function was not initialized. So It seems we need to allow 
lazy initialization when dynamic UDF support is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86380535
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/RangeExprEvaluator.java
 ---
@@ -0,0 +1,282 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.stat;
+
+import com.google.common.base.Preconditions;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.expression.FunctionHolderExpression;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.expression.fn.CastFunctions;
+import org.apache.drill.common.expression.fn.FuncHolder;
+import org.apache.drill.common.expression.visitors.AbstractExprVisitor;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder;
+import org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.Float4Holder;
+import org.apache.drill.exec.expr.holders.Float8Holder;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.store.parquet.stat.ColumnStatistics;
+import org.apache.drill.exec.vector.ValueHolderHelper;
+import org.apache.parquet.column.statistics.DoubleStatistics;
+import org.apache.parquet.column.statistics.FloatStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.LongStatistics;
+import org.apache.parquet.column.statistics.Statistics;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class RangeExprEvaluator extends AbstractExprVisitor {
+  static final Logger logger = 
LoggerFactory.getLogger(RangeExprEvaluator.class);
+
+  private final Map columnStatMap;
+  private final long rowCount;
+
+  public RangeExprEvaluator(final Map 
columnStatMap, long rowCount) {
+this.columnStatMap = columnStatMap;
+this.rowCount = rowCount;
+  }
+
+  public long getRowCount() {
+return this.rowCount;
+  }
+
+  @Override
+  public Statistics visitUnknown(LogicalExpression e, Void value) throws 
RuntimeException {
+if (e instanceof TypedFieldExpr) {
+  TypedFieldExpr fieldExpr = (TypedFieldExpr) e;
+  final ColumnStatistics columnStatistics = 
columnStatMap.get(fieldExpr.getPath());
+  if (columnStatistics != null) {
+return columnStatistics.getStatistics();
+  } else {
+// field does not exist.
+
Preconditions.checkArgument(fieldExpr.getMajorType().equals(Types.OPTIONAL_INT));
+IntStatistics intStatistics = new IntStatistics();
+intStatistics.setNumNulls(rowCount); // all values are nulls
+return intStatistics;
+  }
+}
+return null;
+  }
+
+  @Override
+  public Statistics visitIntConstant(ValueExpressions.IntExpression expr, 
Void value) throws RuntimeException {
+return getStatistics(expr.getInt());
+  }
+
+  @Override
+  public Statistics visitLongConstant(ValueExpressions.LongExpression 
expr, Void value) throws RuntimeException {
+return getStatistics(expr.getLong());
+  }
+
+  @Override
+  public Statistics visitFloatConstant(ValueExpressions.FloatExpression 
expr, Void value) throws RuntimeException {
+return getStatistics(expr.getFloat());
+  }
+
+  @Override
+  public Statistics visitDoubleConstant(ValueEx

[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86377522
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java ---
@@ -346,6 +355,20 @@ static RuleSet getPruneScanRules(OptimizerRulesContext 
optimizerRulesContext) {
   }
 
   /**
+   *   Get an immutable list of partition pruning rules that will be used 
in logical planning.
+   */
+  static RuleSet getPhysicalPruneScanRules(OptimizerRulesContext 
optimizerRulesContext) {
+final ImmutableSet pruneRules = 
ImmutableSet.builder()
+.add(
+
ParquetPushDownFilter.getFilterOnProject(optimizerRulesContext),
--- End diff --

The ParquetFilterPushdown rule should ideally be done during logical 
planning since this rule affects selectivity.  I understand that you saw some 
plan regressions if this was done during logical planning.  For now, we can 
leave this in the physical planning phase (add some comments here) and create a 
JIRA to re-visit .  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86376128
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java
 ---
@@ -0,0 +1,146 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet.stat;
+
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.store.parquet.Metadata;
+import org.apache.drill.exec.store.parquet.ParquetGroupScan;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.DoubleStatistics;
+import org.apache.parquet.column.statistics.FloatStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.LongStatistics;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.joda.time.DateTimeConstants;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+public class ParquetMetaStatCollector implements  ColumnStatCollector{
+  private  final Metadata.ParquetTableMetadataBase parquetTableMetadata;
+  private  final List 
columnMetadataList;
+  final Map implicitColValues;
+
+  public ParquetMetaStatCollector(Metadata.ParquetTableMetadataBase 
parquetTableMetadata, List 
columnMetadataList, Map implicitColValues) {
+this.parquetTableMetadata = parquetTableMetadata;
+this.columnMetadataList = columnMetadataList;
+this.implicitColValues = implicitColValues;
+  }
+
+  @Override
+  public Map collectColStat(Set 
fields) {
+// map from column to ColumnMetadata
+final Map columnMetadataMap = new 
HashMap<>();
+
+// map from column name to column statistics.
+final Map statMap = new HashMap<>();
+
+for (final Metadata.ColumnMetadata columnMetadata : 
columnMetadataList) {
+  SchemaPath schemaPath = 
SchemaPath.getCompoundPath(columnMetadata.getName());
+  columnMetadataMap.put(schemaPath, columnMetadata);
+}
+
+for (final SchemaPath schemaPath : fields) {
+  final PrimitiveType.PrimitiveTypeName primitiveType;
+  final OriginalType originalType;
+
+  final Metadata.ColumnMetadata columnMetadata = 
columnMetadataMap.get(schemaPath);
+
+  if (columnMetadata != null) {
+final Object min = columnMetadata.getMinValue();
+final Object max = columnMetadata.getMaxValue();
+final Long numNull = columnMetadata.getNulls();
+
+primitiveType = 
this.parquetTableMetadata.getPrimitiveType(columnMetadata.getName());
+originalType = 
this.parquetTableMetadata.getOriginalType(columnMetadata.getName());
+final Integer repetitionLevel = 
this.parquetTableMetadata.getRepetitionLevel(columnMetadata.getName());
+
+statMap.put(schemaPath, getStat(min, max, numNull, primitiveType, 
originalType, repetitionLevel));
+  } else {
+final String columnName = schemaPath.getRootSegment().getPath();
+if (implicitColValues.containsKey(columnName)) {
--- End diff --

It is not obvious why implicit column stats are collected as part of 
ParquetMetaStatsCollector since implicit columns are not part of the parquet 
footers.  Can it be done elsewhere ?  If it is absolutely needed here, you 
should add a comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86385152
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicates.java
 ---
@@ -0,0 +1,334 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.stat;
+
+import org.apache.drill.common.expression.BooleanOperator;
+import org.apache.drill.common.expression.ExpressionPosition;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.LogicalExpressionBase;
+import org.apache.drill.common.expression.visitors.ExprVisitor;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+public abstract  class ParquetPredicates {
+  public static abstract  class ParquetCompPredicate extends 
LogicalExpressionBase implements ParquetFilterPredicate {
+protected final LogicalExpression left;
+protected final LogicalExpression right;
+
+public ParquetCompPredicate(LogicalExpression left, LogicalExpression 
right) {
+  super(left.getPosition());
+  this.left = left;
+  this.right = right;
+}
+
+@Override
+public Iterator iterator() {
+  final List args = new ArrayList<>();
+  args.add(left);
+  args.add(right);
+  return args.iterator();
+}
+
+@Override
+public  T accept(ExprVisitor 
visitor, V value) throws E {
+  return visitor.visitUnknown(this, value);
+}
+
+  }
+
+  public static abstract class ParquetBooleanPredicate extends 
BooleanOperator implements ParquetFilterPredicate {
+public ParquetBooleanPredicate(String name, List 
args, ExpressionPosition pos) {
+  super(name, args, pos);
+}
+
+@Override
+public  T accept(ExprVisitor 
visitor, V value) throws E {
+  return visitor.visitBooleanOperator(this, value);
+}
+  }
+
+  public static class AndPredicate extends ParquetBooleanPredicate {
+public AndPredicate(String name, List args, 
ExpressionPosition pos) {
+  super(name, args, pos);
+}
+
+@Override
+public boolean canDrop(RangeExprEvaluator evaluator) {
+  // "and" : as long as one branch is OK to drop, we can drop it.
+  for (LogicalExpression child : this) {
+if (((ParquetFilterPredicate) child).canDrop(evaluator)) {
+  return true;
+}
+  }
+  return false;
+}
+  }
+
+  public static class OrPredicate extends ParquetBooleanPredicate {
+public OrPredicate(String name, List args, 
ExpressionPosition pos) {
+  super(name, args, pos);
+}
+
+@Override
+public boolean canDrop(RangeExprEvaluator evaluator) {
+  for (LogicalExpression child : this) {
+// "long" : as long as one branch is NOT ok to drop, we can NOT 
drop it.
+if (! ((ParquetFilterPredicate) child).canDrop(evaluator)) {
+  return false;
+}
+  }
+
+  return true;
+}
+  }
+
+  // is this column chunk composed entirely of nulls?
+  // assumes the column chunk's statistics is not empty
+  protected static boolean isAllNulls(Statistics stat, long rowCount) {
+return stat.getNumNulls() == rowCount;
+  }
+
+  // are there any nulls in this column chunk?
+  // assumes the column chunk's statistics is not empty
+  protected static boolean hasNulls(Statistics stat) {
+return stat.getNumNulls() > 0;
+  }
+
+  /**
+   * EQ (=) predicate
+   */
+  public static class EqualPredicate extends ParquetCompPredicate {
+public EqualPredicate(LogicalExpression left, LogicalExpression right) 
{
+  super(left, r

[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86374570
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetFooterStatCollector.java
 ---
@@ -0,0 +1,176 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet.stat;
+
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.ParquetOutputRecordWriter;
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+import 
org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.LongStatistics;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.joda.time.DateTimeConstants;
+import org.joda.time.DateTimeUtils;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+public class ParquetFooterStatCollector implements ColumnStatCollector {
--- End diff --

For the stat collector, can you add a stopwatch and log the elapsed 
time...it will help with debugging planning time. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86372508
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -1000,6 +1053,81 @@ public long getColumnValueCount(SchemaPath column) {
 
   @Override
   public List getPartitionColumns() {
-return new ArrayList<>(columnTypeMap.keySet());
+return new ArrayList<>(partitionColTypeMap.keySet());
   }
+
+  public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities 
udfUtilities,
+  FunctionImplementationRegistry functionImplementationRegistry, 
OptionManager optionManager) {
+if (fileSet.size() == 1 || ! (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v3)) {
--- End diff --

For the version check can we have a isComptabile() method defined in the 
Metadata class instead of doing the instanceof  ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86376877
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java ---
@@ -346,6 +355,20 @@ static RuleSet getPruneScanRules(OptimizerRulesContext 
optimizerRulesContext) {
   }
 
   /**
+   *   Get an immutable list of partition pruning rules that will be used 
in logical planning.
--- End diff --

This comments seems wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86373342
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRGFilterEvaluator.java
 ---
@@ -0,0 +1,252 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Sets;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.visitors.AbstractExprVisitor;
+import org.apache.drill.exec.compile.sig.ConstantExpressionIdentifier;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.fn.FunctionImplementationRegistry;
+import org.apache.drill.exec.expr.stat.ParquetFilterPredicate;
+import org.apache.drill.exec.expr.stat.RangeExprEvaluator;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.UdfUtilities;
+import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.parquet.stat.ColumnStatCollector;
+import org.apache.drill.exec.store.parquet.stat.ColumnStatistics;
+import org.apache.drill.exec.store.parquet.stat.ParquetFooterStatCollector;
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.statisticslevel.StatisticsFilter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+public class ParquetRGFilterEvaluator {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetRGFilterEvaluator.class);
+
+  public static boolean evalFilter(LogicalExpression expr, ParquetMetadata 
footer, int rowGroupIndex,
+  OptionManager options, FragmentContext fragmentContext) {
+final HashMap emptyMap = new HashMap();
+return evalFilter(expr, footer, rowGroupIndex, options, 
fragmentContext, emptyMap);
+  }
+
+  public static boolean evalFilter(LogicalExpression expr, ParquetMetadata 
footer, int rowGroupIndex,
+  OptionManager options, FragmentContext fragmentContext, Map implicitColValues) {
+// figure out the set of columns referenced in expression.
+final Set schemaPathsInExpr = expr.accept(new 
FieldReferenceFinder(), null);
+final ColumnStatCollector columnStatCollector = new 
ParquetFooterStatCollector(footer, rowGroupIndex, implicitColValues,true, 
options);
+
+Map columnStatisticsMap = 
columnStatCollector.collectColStat(schemaPathsInExpr);
+
+boolean canDrop = canDrop(expr, columnStatisticsMap, 
footer.getBlocks().get(rowGroupIndex).getRowCount(), fragmentContext, 
fragmentContext.getFunctionRegistry());
+return canDrop;
+  }
+
+
+  public static boolean canDrop(ParquetFilterPredicate parquetPredicate, 
Map columnStatisticsMap, long rowCount) {
+boolean canDrop = false;
+if (parquetPredicate != null) {
+  RangeExprEvaluator rangeExprEvaluator = new 
RangeExprEvaluator(columnStatisticsMap, rowCount);
+  canDrop = parquetPredicate.canDrop(rangeExprEvaluator);
+}
+return canDrop;
+  }
+
+
+  public static boolean canDrop(LogicalExpression expr, Map columnStatisticsMap,
+  long rowCount, UdfUtilities udfUtilities, 
FunctionImplementationRegistry functionImplementationRegistry) {
+ErrorCollector errorCollector = new ErrorCollectorImpl();
+LogicalExpression materializedFilter = 
ExpressionTreeMaterializer.materializeFilterExpr(
+expr, columnStatisticsMap, errorCollector, 
functionImplementationRegistry);
+
+if (errorCollector.hasErrors()) {

[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86386510
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -214,14 +218,24 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
   }
   // written by a tool that wasn't Drill, the dates are not 
corrupted
   return DateCorruptionStatus.META_SHOWS_NO_CORRUPTION;
-} catch (VersionParser.VersionParseException e) {
-  // If we couldn't parse "created by" field, check column 
metadata of date columns
-  return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
 }
   }
+} catch (VersionParser.VersionParseException e) {
+  // If we couldn't parse "created by" or "drill version", check 
column metadata of date columns
+  return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
 }
   }
 
+  public static boolean isDrillVersionHasCorrectDates(String drillVersion) 
throws VersionParser.VersionParseException {
--- End diff --

Might flow better as "drillVersionHasCorrectDates"

But, see comment above about whether we need this check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86384226
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
--- End diff --

This property is Jackson-serialized. How does Jackson handle older files 
written without this version? According to this post 
(http://stackoverflow.com/questions/8320993/jackson-what-happens-if-a-property-is-missing),
 "Setter methods are only invoked for properties with explicit values." This 
means that older files without the writerVersion set won't call the function to 
deserialize the writer version, and the version will default to the newest 
version. I suspect that either A) I'm misunderstanding what this code does, or 
B) we have a backward-compatibility issue. Please explain which.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86384912
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -944,14 +940,16 @@ public 
ParquetTableMetadata_v2(ParquetTableMetadataBase parquetTable,
   this.directories = directories;
   this.columnTypeInfo = ((ParquetTableMetadata_v2) 
parquetTable).columnTypeInfo;
   this.drillVersion = DrillVersionInfo.getVersion();
-  this.isDateCorrect = true;
+  this.writerVersion = ParquetWriter.WRITER_VERSION;
 }
 
 public ParquetTableMetadata_v2(List files, 
List directories,
 ConcurrentHashMap columnTypeInfo) {
   this.files = files;
   this.directories = directories;
   this.columnTypeInfo = columnTypeInfo;
+  this.drillVersion = DrillVersionInfo.getVersion();
+  this.writerVersion = ParquetWriter.WRITER_VERSION;
--- End diff --

Given a set of files and directories, we're assuming the writer version? 
This is supposed to be the version of the Drill Parquet writer that wrote the 
actual Parquet files, right? Not the version of the metadata files that hold 
information about the Parquet files? Or, are we using the same writer version 
for both purposes? If we need two versions (one for Parquet, another for 
metadata), then we should introduce a separate Parquet metadata version. (Or, 
I'm misunderstanding the code...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86383165
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
 
 public ParquetTableMetadata_v2() {
-  super();
-}
-
-public ParquetTableMetadata_v2(boolean isDateCorrect) {
   this.drillVersion = DrillVersionInfo.getVersion();
-  this.isDateCorrect = isDateCorrect;
+  this.writerVersion = ParquetWriter.WRITER_VERSION;
--- End diff --

We set the writer version to the current version when we create the 
metadata. Is this same metadata used for both read and write? If so, we have 
the potential for a nasty bug. A (new) reader fails to set the writerVersion 
value from actual file metadata. The value will default to the latest, even if 
the file itself happens to be older.

I wonder if it makes sense to pass the version into the constructor. The 
Writer passes in the current writer version. The reader must pass in the value 
found in the file.

Or, is this metadata used only for writing, but not reading? If that is 
true perhaps we can document that in the code somewhere. (I looked but did not 
anything.)

Or, is this metadata cached from scanning actual files? If so, isn't 
defaulting the writer version simply asking for trouble if someone forgets to 
set this field based on actual file version?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86387445
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -189,19 +189,23 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
 
 String createdBy = footer.getFileMetaData().getCreatedBy();
 String drillVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY);
-String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
-if (drillVersion != null) {
-  return Boolean.valueOf(isDateCorrect) ? 
DateCorruptionStatus.META_SHOWS_NO_CORRUPTION
-  : DateCorruptionStatus.META_SHOWS_CORRUPTION;
-} else {
-  // Possibly an old, un-migrated Drill file, check the column 
statistics to see if min/max values look corrupt
-  // only applies if there is a date column selected
-  if (createdBy.equals("parquet-mr")) {
-// loop through parquet column metadata to find date columns, 
check for corrupt values
-return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
+String writerVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+// This flag can be present in parquet files which were generated with 
1.9.0-SNAPSHOT drill version
+final String isDateCorrectFlag = "is.date.correct";
+String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag);
+try {
+  if (drillVersion != null) {
+return (writerVersion != null && Integer.parseInt(writerVersion) 
>= 2) || Boolean.valueOf(isDateCorrect)
+|| isDrillVersionHasCorrectDates(drillVersion)
--- End diff --

Isn't the Drill version check redundant? We know that all Drill versions 
from 1.9.0 onwards will have a Drill Parquet writer version in the file. So, we 
only need check the writer version, if we have it. If we don't have it, then 
the only info we have is the Drill version.

We might want to explain this logic in a comment.

The purpose of adding the writer version is that all future format 
decisions can be made on the writer version independent of Drill version. For 
example, suppose we change something in Drill 1.10. Drill 1.10.0-SNAPSHOT will 
start with writer version 2. Later we'll make a change and 1.10.0-SNAPSHOT will 
writer files with writer version 3. The Drill version is ambiguous, the writer 
version is spot on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86388085
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
@@ -78,7 +78,7 @@
   private static final int MAXIMUM_RECORD_COUNT_FOR_CHECK = 1;
 
   public static final String DRILL_VERSION_PROPERTY = "drill.version";
-  public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct";
+  public static final String WRITER_VERSION_PROPERTY = 
"parquet-writer.version";
--- End diff --

We know we are Drill, but others might be confused. Perhaps 
"drill-writer.version" or "drill.writer-version" (to be consistent with 
"drill.version".)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86388727
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java
 ---
@@ -40,6 +40,8 @@
 public class ParquetWriter extends AbstractWriter {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);
 
+  public static final int WRITER_VERSION = 2;
--- End diff --

This deserves an explanation. Something like:

Version of Drill's Parquet writer. Increment this version (by 1) any time 
we make any format change to the file. Format changes include 1) supporting new 
data types, 2) changes to the format of data fields, 3) adding new metadata to 
the file footer, etc.

Newer readers must be able to read old files. The Writer version tells the 
Parquet reader how to interpret fields or metadata when that data changes 
format from one writer version to another.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86387810
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -214,14 +218,24 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
   }
   // written by a tool that wasn't Drill, the dates are not 
corrupted
   return DateCorruptionStatus.META_SHOWS_NO_CORRUPTION;
-} catch (VersionParser.VersionParseException e) {
-  // If we couldn't parse "created by" field, check column 
metadata of date columns
-  return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
 }
   }
+} catch (VersionParser.VersionParseException e) {
+  // If we couldn't parse "created by" or "drill version", check 
column metadata of date columns
+  return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
 }
   }
 
+  public static boolean isDrillVersionHasCorrectDates(String drillVersion) 
throws VersionParser.VersionParseException {
+VersionParser.ParsedVersion parsedDrillVersion = 
parseDrillVersion(drillVersion);
+SemanticVersion semVer = parsedDrillVersion.getSemanticVersion();
+// true for 1.9.0, 2.0.0-SNAPSHOT, 2.0.0 
etc. (false for 1.9.0-SNAPSHOT, 
1.8.0, 1.8.0-SANPSHOT etc.)
+return semVer != null && semVer.compareTo(new SemanticVersion(1, 9, 
0)) >= 0;
+  }
+
+  public static VersionParser.ParsedVersion parseDrillVersion(String 
drillVersion) throws VersionParser.VersionParseException {
+return VersionParser.parse("drill version " + drillVersion + " (build 
1234)");
--- End diff --

The " (build 1234)" deserves some explanation. Does the parse function 
demand it so that we have to make up something here just to get the version to 
parse?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #645: DRILL-4995: Allow lazy init when dynamic UDF suppor...

2016-11-03 Thread arina-ielchiieva
GitHub user arina-ielchiieva opened a pull request:

https://github.com/apache/drill/pull/645

DRILL-4995: Allow lazy init when dynamic UDF support is disabled



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arina-ielchiieva/drill DRILL-4995

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/645.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #645


commit b0b7446697c8b64004f194dd0c1b2ed2fdd23e79
Author: Arina Ielchiieva 
Date:   2016-11-03T16:20:04Z

DRILL-4995: Allow lazy init when dynamic UDF support is disabled




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


question about drill push down

2016-11-03 Thread 马云
Hi  Drill dev,


I want to use drill in our project. and I face diffcult issues.
As you know, it can pushdown filter operation to scan in the storageplugin


such as in  HBaseStoragePlugin.java


  @Override
  public Set 
getPhysicalOptimizerRules(OptimizerRulesContext optimizerRulesContext) {
return ImmutableSet.of(HBasePushFilterIntoScan.FILTER_ON_SCAN, 
HBasePushFilterIntoScan.FILTER_ON_PROJECT);
  }




My question:  I want to develop an customized storageplugin, I hope to push 
down "group by" to scan. Is there any way to do it?
Please give advice, thanks very much!




thanks
Yun

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86392842
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java 
---
@@ -95,6 +95,11 @@ public DrillBuf getManagedBuffer(int size) {
 return manager.getManagedBuffer(size);
   }
 
+  // Allow and operator to use the thread pool
--- End diff --

"Allow **an** operator..." (typo)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86394036
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/ParquetRecordReader.java
 ---
@@ -207,6 +207,10 @@ public OperatorContext getOperatorContext() {
 return operatorContext;
   }
 
+  public FragmentContext getFragmentContext() {
--- End diff --

Giving the Parquet reader visibility to the FragmentContext causes tight 
coupling between the reader and the fragment (and thus to the Drillbit.) Is 
there a way to keep the reader independent of the fragment context for easier 
unit testing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86405494
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86406660
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/DirectBufInputStream.java
 ---
@@ -0,0 +1,166 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.FilterInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.lang.reflect.InvocationTargetException;
+import java.lang.reflect.Method;
+import java.nio.ByteBuffer;
+
+public class DirectBufInputStream extends FilterInputStream {
+
+  private static final org.slf4j.Logger logger =
+  org.slf4j.LoggerFactory.getLogger(DirectBufInputStream.class);
+
+  protected boolean enableHints = true;
+  protected String streamId; // a name for logging purposes only
+  protected BufferAllocator allocator;
+  /**
+   * The length of the data we expect to read. The caller may, in fact,
+   * ask for more or less bytes. However this is useful for providing 
hints where
+   * the underlying InputStream supports hints (e.g. fadvise)
+   */
+  protected final long totalByteSize;
+
+  /**
+   * The offset in the underlying stream to start reading from
+   */
+  protected final long startOffset;
+
+  public DirectBufInputStream(InputStream in, BufferAllocator allocator, 
String id, long startOffset,
+  long totalByteSize, boolean enableHints) {
+super(in);
+Preconditions.checkArgument(startOffset >= 0);
+Preconditions.checkArgument(totalByteSize >= 0);
+this.streamId = id;
+this.allocator = allocator;
+this.startOffset = startOffset;
+this.totalByteSize = totalByteSize;
+this.enableHints = enableHints;
+  }
+
+  public void init() throws IOException, UnsupportedOperationException {
+checkStreamSupportsByteBuffer();
+if (enableHints) {
+  fadviseIfAvailable(getInputStream(), this.startOffset, 
this.totalByteSize);
+}
+getInputStream().seek(this.startOffset);
+return;
+  }
+
+  public int read() throws IOException {
+return getInputStream().read();
+  }
+
+  public synchronized int read(DrillBuf buf, int off, int len) throws 
IOException {
+buf.clear();
+ByteBuffer directBuffer = buf.nioBuffer(0, len);
--- End diff --

Seems to create garbage on every read (to create the ByteBuffer with the 
proper offset and length.) Any way to do the read without garbage?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86393092
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/PageReader.java
 ---
@@ -48,22 +48,26 @@
 import org.apache.parquet.hadoop.metadata.CompressionCodecName;
 import org.apache.parquet.schema.PrimitiveType;
 
-import com.google.common.base.Preconditions;
-import com.google.common.base.Stopwatch;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.concurrent.TimeUnit;
 
-import io.netty.buffer.ByteBuf;
-import io.netty.buffer.DrillBuf;
+import static org.apache.parquet.column.Encoding.valueOf;
+import static 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics;
 
 // class to keep track of the read position of variable length columns
 final class PageReader {
-  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(PageReader.class);
+  static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(
+  org.apache.drill.exec.store.parquet.columnreaders.PageReader.class);
 
   public static final ParquetMetadataConverter METADATA_CONVERTER = 
ParquetFormatPlugin.parquetMetadataConverter;
 
-  private final ColumnReader parentColumnReader;
-  private final ColumnDataReader dataReader;
-
-  // buffer to store bytes of current page
+  private final 
org.apache.drill.exec.store.parquet.columnreaders.ColumnReader 
parentColumnReader;
+  //private final ColumnDataReader dataReader;
+  private final DirectBufInputStream dataReader;
+  //der; buffer to store bytes of current page
--- End diff --

der; ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86404042
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86394551
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/PageReader.java
 ---
@@ -108,17 +112,31 @@
 
   private final ParquetReaderStats stats;
 
-  PageReader(ColumnReader parentStatus, FileSystem fs, Path path, 
ColumnChunkMetaData columnChunkMetaData)
-throws ExecutionSetupException{
+  
PageReader(org.apache.drill.exec.store.parquet.columnreaders.ColumnReader 
parentStatus, FileSystem fs, Path path, ColumnChunkMetaData columnChunkMetaData)
+throws ExecutionSetupException {
 this.parentColumnReader = parentStatus;
 allocatedDictionaryBuffers = new ArrayList();
 codecFactory = parentColumnReader.parentReader.getCodecFactory();
 this.stats = parentColumnReader.parentReader.parquetReaderStats;
 long start = columnChunkMetaData.getFirstDataPageOffset();
 try {
   inputStream  = fs.open(path);
-  this.dataReader = new ColumnDataReader(inputStream, start, 
columnChunkMetaData.getTotalSize());
-  loadDictionaryIfExists(parentStatus, columnChunkMetaData, 
inputStream);
+  BufferAllocator allocator =  
parentColumnReader.parentReader.getOperatorContext().getAllocator();
+  //TODO: make read batch size configurable
+  columnChunkMetaData.getTotalUncompressedSize();
+  boolean useBufferedReader  = 
parentColumnReader.parentReader.getFragmentContext().getOptions()
+  
.getOption(ExecConstants.PARQUET_PAGEREADER_USE_BUFFERED_READ).bool_val;
+  if (useBufferedReader) {
+this.dataReader = new BufferedDirectBufInputStream(inputStream, 
allocator, path.getName(),
+columnChunkMetaData.getStartingPos(), 
columnChunkMetaData.getTotalSize(), 8 * 1024 * 1024,
--- End diff --

Should this be declared as a constant? Is it supposed to match 
BufferedDirectBufInputStream.directBufferSize below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86405791
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86405264
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86405115
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86399835
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86395060
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,467 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
--- End diff --

MiB is a newer term, but it does appear here and there.

Looks like defaultBufferSize may be a constant (since it is static.) Should 
it be DEFAULT_BUFFER_SIZE?

If this is not a constant, then what are the semantics of setting it? How 
do we ensure it is set (statically) before the first buffering reader is 
created?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86403197
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86399582
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86400794
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86406381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/DirectBufInputStream.java
 ---
@@ -0,0 +1,166 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.FilterInputStream;
+import java.io.IOException;
+import java.io.InputStream;
+import java.lang.reflect.InvocationTargetException;
+import java.lang.reflect.Method;
+import java.nio.ByteBuffer;
+
+public class DirectBufInputStream extends FilterInputStream {
+
+  private static final org.slf4j.Logger logger =
+  org.slf4j.LoggerFactory.getLogger(DirectBufInputStream.class);
+
+  protected boolean enableHints = true;
+  protected String streamId; // a name for logging purposes only
+  protected BufferAllocator allocator;
+  /**
+   * The length of the data we expect to read. The caller may, in fact,
+   * ask for more or less bytes. However this is useful for providing 
hints where
+   * the underlying InputStream supports hints (e.g. fadvise)
+   */
+  protected final long totalByteSize;
+
+  /**
+   * The offset in the underlying stream to start reading from
+   */
+  protected final long startOffset;
+
+  public DirectBufInputStream(InputStream in, BufferAllocator allocator, 
String id, long startOffset,
+  long totalByteSize, boolean enableHints) {
+super(in);
+Preconditions.checkArgument(startOffset >= 0);
+Preconditions.checkArgument(totalByteSize >= 0);
+this.streamId = id;
+this.allocator = allocator;
+this.startOffset = startOffset;
+this.totalByteSize = totalByteSize;
+this.enableHints = enableHints;
+  }
+
+  public void init() throws IOException, UnsupportedOperationException {
+checkStreamSupportsByteBuffer();
+if (enableHints) {
+  fadviseIfAvailable(getInputStream(), this.startOffset, 
this.totalByteSize);
+}
+getInputStream().seek(this.startOffset);
+return;
+  }
+
+  public int read() throws IOException {
+return getInputStream().read();
+  }
+
+  public synchronized int read(DrillBuf buf, int off, int len) throws 
IOException {
+buf.clear();
+ByteBuffer directBuffer = buf.nioBuffer(0, len);
+int lengthLeftToRead = len;
+while (lengthLeftToRead > 0) {
+  lengthLeftToRead -= CompatibilityUtil.getBuf(getInputStream(), 
directBuffer, lengthLeftToRead);
+}
+buf.writerIndex(len);
+return len;
+  }
+
+  public synchronized DrillBuf getNext(int bytes) throws IOException {
+DrillBuf b = allocator.buffer(bytes);
+int bytesRead = read(b, 0, bytes);
+if (bytesRead <= -1) {
+  b.release();
+  return null;
+}
+return b;
+  }
+
+  public long getPos() throws IOException {
+return getInputStream().getPos();
+  }
+
+  public boolean hasRemainder() throws IOException {
+return getInputStream().available() > 0;
+  }
+
+  protected FSDataInputStream getInputStream() throws IOException {
+// Make sure stream is open
+checkInputStreamState();
+return (FSDataInputStream) in;
+  }
+
+  protected void checkInputStreamState() throws IOException {
+if (in == null) {
+  throw new IOException("Input stream is closed.");
+}
+  }
+
+  protected void checkStreamSupportsByteBuffer() throws 
UnsupportedOperationException {
+// Check input stream supports ByteBuffer
  

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86398874
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86402801
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86393658
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/PageReader.java
 ---
@@ -370,7 +400,11 @@ public void clearDictionaryBuffers() {
   }
 
   public void clear(){
-this.dataReader.clear();
+try {
+  this.dataReader.close();
+} catch (IOException e) {
+  //TODO: Throw UserException
--- End diff --

Reader IO errors on close are benign. Change comment to

// Ignore

?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86395782
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
--- End diff --

Is this class entirely necessary? A buffers input stream uses heap memory, 
but does so in a known way. Buffers are of limited life and limited size. Is 
the benefit of using direct memory worth the cost of creating, testing and 
maintaining a separate class? If so, perhaps the comment can explain that 
benefit and why using heap memory here is not a good idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86402453
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86404983
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86398411
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

[GitHub] drill pull request #611: Drill-4800: Improve parquet reader performance

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/611#discussion_r86397371
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/filereader/BufferedDirectBufInputStream.java
 ---
@@ -0,0 +1,460 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util.filereader;
+
+import com.google.common.base.Preconditions;
+import io.netty.buffer.DrillBuf;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.memory.RootAllocatorFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.ByteBufferReadable;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.Footer;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.BlockMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.util.CompatibilityUtil;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.InputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+
+/**
+ * BufferedDirectBufInputStream  reads from the
+ * underlying InputStream in blocks of data, into an
+ * internal buffer. The internal buffer is a direct memory backed
+ * buffer. The implementation is similar to the 
BufferedInputStream
+ * class except that the internal buffer is a Drillbuf and
+ * not a byte array. The mark and reset methods of the underlying
+ * InputStreamare not supported.
+ */
+public class BufferedDirectBufInputStream extends DirectBufInputStream 
implements Closeable {
+
+  private static final org.slf4j.Logger logger =
+  
org.slf4j.LoggerFactory.getLogger(BufferedDirectBufInputStream.class);
+
+  private static int defaultBufferSize = 8192 * 1024; // 8 MiB
+  private static int defaultTempBufferSize = 8192; // 8 KiB
+
+  /**
+   * The internal buffer to keep data read from the underlying inputStream.
+   * internalBuffer[0]  through internalBuffer[count-1] 

+   * contains data read from the underlying  input stream.
+   */
+  protected volatile DrillBuf internalBuffer; // the internal buffer
+
+  /**
+   * The number of valid bytes in internalBuffer.
+   *  count  is always in the range 
[0,internalBuffer.capacity]
+   * internalBuffer[count-1] is the last valid byte in the 
buffer.
+   */
+  protected int count;
+
+  /**
+   * The current read position in the buffer; the index of the next
+   * character to be read from the internalBuffer array.
+   * 
+   * This value is always in the range [0,count].
+   * If curPosInBuffer is equal to count> then 
we have read
+   * all the buffered data and the next read (or skip) will require more 
data to be read
+   * from the underlying input stream.
+   */
+  protected int curPosInBuffer;
+
+  protected long curPosInStream; // current offset in the input stream
+
+  private final int bufSize;
+
+  private volatile DrillBuf tempBuffer; // a temp Buffer for use by 
read(byte[] buf, int off, int len)
+
+
+  private DrillBuf getBuf() throws IOException {
+checkInputStreamState();
+if (internalBuffer == null) {
+  throw new IOException("Input stream is closed.");
+}
+return this.internalBuffer;
+  }
+
+  /**
+   * Creates a BufferedDirectBufInputStream
+   * with the default (8 MiB) buffer size.
+   */
+  public BufferedDirectBufInputStream(InputStream in, BufferAllocator 
allocator, String id,
+  long startOffset, long totalByteSize, boolean enableHints) {
+this(in, allocator, id, startOffset, totalByteSize, defaultBufferSize, 
enableHints);
+  }
+
+  /**
+   * Creates a BufferedDi

Updated Documentation, DRILL-3423

2016-11-03 Thread Charles Givre
Devs,
What is the process for submitting updates to the Drill documentation for
this new feature? (HTTPD Log Parser)  I'd be happy to write it up.
-- C


Re: Updated Documentation, DRILL-3423

2016-11-03 Thread Parth Chandra
Here's what I do -
  Create a JIRA.
  Write documentation as a Gist.
  Add the link to the Gist in the JIRA.
  Change label in the JIRA to doc-impacting


On Thu, Nov 3, 2016 at 11:15 AM, Charles Givre  wrote:

> Devs,
> What is the process for submitting updates to the Drill documentation for
> this new feature? (HTTPD Log Parser)  I'd be happy to write it up.
> -- C
>


[jira] [Created] (DRILL-4996) Parquet Date auto-correction is not working in auto-partitioned parquet files generated by drill-1.6

2016-11-03 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-4996:


 Summary: Parquet Date auto-correction is not working in 
auto-partitioned parquet files generated by drill-1.6
 Key: DRILL-4996
 URL: https://issues.apache.org/jira/browse/DRILL-4996
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Reporter: Rahul Challapalli
Priority: Critical


git.commit.id.abbrev=4ee1d4c

Below are the steps I followed to generate the data :
{code}
1. Generate a parquet file with date column using hive1.2
2. Use drill 1.6 to create auto-partitioned parquet files partitioned on the 
date column
{code}

Now the below query returns wrong results :
{code}
select i_rec_start_date, i_size from 
dfs.`/drill/testdata/parquet_date/auto_partition/item_multipart_autorefresh`  
group by i_rec_start_date, i_size;
+---+--+
| i_rec_start_date  |i_size|
+---+--+
| null  | large|
| 366-11-08| extra large  |
| 366-11-08| medium   |
| null  | medium   |
| 366-11-08| petite   |
| 364-11-07| medium   |
| null  | petite   |
| 365-11-07| medium   |
| 368-11-07| economy  |
| 365-11-07| large|
| 365-11-07| small|
| 366-11-08| small|
| 365-11-07| extra large  |
| 364-11-07| N/A  |
| 366-11-08| economy  |
| 366-11-08| large|
| 364-11-07| small|
| null  | small|
| 364-11-07| large|
| 364-11-07| extra large  |
| 368-11-07| N/A  |
| 368-11-07| extra large  |
| 368-11-07| large|
| 365-11-07| petite   |
| null  | N/A  |
| 365-11-07| economy  |
| 364-11-07| economy  |
| 364-11-07| petite   |
| 365-11-07| N/A  |
| 368-11-07| medium   |
| null  | extra large  |
| 368-11-07| small|
| 368-11-07| petite   |
| 366-11-08| N/A  |
+---+--+
34 rows selected (0.691 seconds)
{code}

However I tried generating the auto-partitioned parquet files using Drill 1.2 
and then the above query returned the right results.

I attached the required data sets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill issue #581: DRILL-4864: Add ANSI format for date/time functions

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/581
  
Please see comments in the JIRA entry. IMHO, the approach taken can be 
improved. We should not attempt to change the format used by the existing 
to_date function. Instead, we should 1) define the new dialect (is there really 
an ANSI standard?) and 2) create a new function: sql_to_date that uses the new 
format.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #619: DRILL-4946: redirect System.err so users under embe...

2016-11-03 Thread chunhui-shi
Github user chunhui-shi commented on a diff in the pull request:

https://github.com/apache/drill/pull/619#discussion_r86431608
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/StdErrCapturer.java ---
@@ -0,0 +1,81 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+
+import java.io.ByteArrayOutputStream;
+import java.io.OutputStream;
+import java.io.PrintStream;
+import java.nio.charset.StandardCharsets;
+
+/**
+ * The safe way of this class is to redirect System.err to nullOutputStream
+ * If
--- End diff --

The comment is now finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #619: DRILL-4946: redirect System.err so users under embe...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/619#discussion_r86435395
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/StdErrCapturer.java ---
@@ -0,0 +1,83 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+
+import java.io.ByteArrayOutputStream;
+import java.io.OutputStream;
+import java.io.PrintStream;
+import java.nio.charset.StandardCharsets;
+
+/**
+ * The safe way to use this class is to redirect System.err to 
nullOutputStream.
+ * If callers want to get the redirected output -ito print or log 
elsewhere - as used in unit tests and they
+ * need the output to be neat, then there should be advanced ways to 
preserve the output to avoid concurrent 
+ * inputs messing up. But this class is not designed for that.
+ */
+public class StdErrCapturer {
+  static private PrintStream savedStdErr;
+  static private OutputStream buffer;
+  static private PrintStream capturingStream;
+  static Boolean redirected = false;
+
+  public static void redirect() {
--- End diff --

Do in constructor (so it can be reversed in close.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #619: DRILL-4946: redirect System.err so users under embe...

2016-11-03 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/619#discussion_r86435277
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/StdErrCapturer.java ---
@@ -0,0 +1,83 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+
+import java.io.ByteArrayOutputStream;
+import java.io.OutputStream;
+import java.io.PrintStream;
+import java.nio.charset.StandardCharsets;
+
+/**
+ * The safe way to use this class is to redirect System.err to 
nullOutputStream.
+ * If callers want to get the redirected output -ito print or log 
elsewhere - as used in unit tests and they
+ * need the output to be neat, then there should be advanced ways to 
preserve the output to avoid concurrent 
+ * inputs messing up. But this class is not designed for that.
+ */
+public class StdErrCapturer {
--- End diff --

If an exception occurs, this class can leave the System.stderr redirected, 
causing mysterious errors. Suggestion: make this AutoClosable: 

try ( StdErrCapturer cap = new StdErrCapturer( ) ) {
  // Do your stuff here
}

The above will ensure that stderr is restored whether the enclosed code 
succeeds or fails.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86444051
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -77,7 +78,7 @@
 public class Metadata {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(Metadata.class);
 
-  public static final String[] OLD_METADATA_FILENAMES = 
{".drill.parquet_metadata.v2"};
+  public static final String[] OLD_METADATA_FILENAMES = 
{".drill.parquet_metadata.v3"};
--- End diff --

I reversed this change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86444637
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetCompareFunctionProcessor.java
 ---
@@ -0,0 +1,280 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import org.apache.drill.common.expression.CastExpression;
+import org.apache.drill.common.expression.ConvertExpression;
+import org.apache.drill.common.expression.FunctionCall;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import 
org.apache.drill.common.expression.ValueExpressions.BooleanExpression;
+import org.apache.drill.common.expression.ValueExpressions.DateExpression;
+import 
org.apache.drill.common.expression.ValueExpressions.DoubleExpression;
+import org.apache.drill.common.expression.ValueExpressions.FloatExpression;
+import org.apache.drill.common.expression.ValueExpressions.IntExpression;
+import org.apache.drill.common.expression.ValueExpressions.LongExpression;
+import org.apache.drill.common.expression.ValueExpressions.QuotedString;
+import org.apache.drill.common.expression.ValueExpressions.TimeExpression;
+import 
org.apache.drill.common.expression.ValueExpressions.TimeStampExpression;
+import org.apache.drill.common.expression.visitors.AbstractExprVisitor;
+
+import com.google.common.collect.ImmutableMap;
+import com.google.common.collect.ImmutableSet;
+import org.joda.time.DateTimeUtils;
+
+public class ParquetCompareFunctionProcessor extends
+AbstractExprVisitor {
+private Object value;
+private boolean success;
+private boolean isEqualityFn;
+private SchemaPath path;
+private String functionName;
+
+public static final long JULIAN_DAY_EPOC = 
DateTimeUtils.toJulianDayNumber(0);
--- End diff --

Right. 

ParquetCompareFunctionProcessor.java actually is one new class in Adam's 
original patch.  Turns out that we do not use this class later on.

I'm going to remove this class entirely. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86449453
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -1000,6 +1053,81 @@ public long getColumnValueCount(SchemaPath column) {
 
   @Override
   public List getPartitionColumns() {
-return new ArrayList<>(columnTypeMap.keySet());
+return new ArrayList<>(partitionColTypeMap.keySet());
   }
+
+  public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities 
udfUtilities,
+  FunctionImplementationRegistry functionImplementationRegistry, 
OptionManager optionManager) {
+if (fileSet.size() == 1 || ! (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v3)) {
+  return null; // no pruning for 1 single parquet file or metadata is 
prior v3.
+}
+
+final Set schemaPathsInExpr = filterExpr.accept(new 
ParquetRGFilterEvaluator.FieldReferenceFinder(), null);
+
+final List qualifiedRGs = new 
ArrayList<>(parquetTableMetadata.getFiles().size());
+Set qualifiedFileNames = Sets.newHashSet(); // HashSet keeps a 
fileName unique.
+
+ParquetFilterPredicate filterPredicate = null;
+
+for (ParquetFileMetadata file : parquetTableMetadata.getFiles()) {
+  final ImplicitColumnExplorer columnExplorer = new 
ImplicitColumnExplorer(optionManager, this.columns);
+  Map implicitColValues = 
columnExplorer.populateImplicitColumns(file.getPath(), selectionRoot);
+
+  for (RowGroupMetadata rowGroup : file.getRowGroups()) {
+ParquetMetaStatCollector statCollector = new 
ParquetMetaStatCollector(
+parquetTableMetadata,
+rowGroup.getColumns(),
+implicitColValues);
+
+Map columnStatisticsMap = 
statCollector.collectColStat(schemaPathsInExpr);
--- End diff --

Right. Filter predicate should be build only once. It's inside the loop 
just we need the column type information during filter expression 
materialization, for both regular columns and implicit columns. 

I put a check if (filterPredicate == null) inside the loop, so that filter 
predicate is built only once. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86455871
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -1000,6 +1053,81 @@ public long getColumnValueCount(SchemaPath column) {
 
   @Override
   public List getPartitionColumns() {
-return new ArrayList<>(columnTypeMap.keySet());
+return new ArrayList<>(partitionColTypeMap.keySet());
   }
+
+  public GroupScan applyFilter(LogicalExpression filterExpr, UdfUtilities 
udfUtilities,
+  FunctionImplementationRegistry functionImplementationRegistry, 
OptionManager optionManager) {
+if (fileSet.size() == 1 || ! (parquetTableMetadata instanceof 
Metadata.ParquetTableMetadata_v3)) {
--- End diff --

I added a isRowGroupPrunable in Metadata class, and removed "instanceof" 
check. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86455969
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRGFilterEvaluator.java
 ---
@@ -0,0 +1,252 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.collect.Sets;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.visitors.AbstractExprVisitor;
+import org.apache.drill.exec.compile.sig.ConstantExpressionIdentifier;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.fn.FunctionImplementationRegistry;
+import org.apache.drill.exec.expr.stat.ParquetFilterPredicate;
+import org.apache.drill.exec.expr.stat.RangeExprEvaluator;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.UdfUtilities;
+import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.parquet.stat.ColumnStatCollector;
+import org.apache.drill.exec.store.parquet.stat.ColumnStatistics;
+import org.apache.drill.exec.store.parquet.stat.ParquetFooterStatCollector;
+import org.apache.parquet.filter2.predicate.FilterPredicate;
+import org.apache.parquet.filter2.statisticslevel.StatisticsFilter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+public class ParquetRGFilterEvaluator {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetRGFilterEvaluator.class);
+
+  public static boolean evalFilter(LogicalExpression expr, ParquetMetadata 
footer, int rowGroupIndex,
+  OptionManager options, FragmentContext fragmentContext) {
+final HashMap emptyMap = new HashMap();
+return evalFilter(expr, footer, rowGroupIndex, options, 
fragmentContext, emptyMap);
+  }
+
+  public static boolean evalFilter(LogicalExpression expr, ParquetMetadata 
footer, int rowGroupIndex,
+  OptionManager options, FragmentContext fragmentContext, Map implicitColValues) {
+// figure out the set of columns referenced in expression.
+final Set schemaPathsInExpr = expr.accept(new 
FieldReferenceFinder(), null);
+final ColumnStatCollector columnStatCollector = new 
ParquetFooterStatCollector(footer, rowGroupIndex, implicitColValues,true, 
options);
+
+Map columnStatisticsMap = 
columnStatCollector.collectColStat(schemaPathsInExpr);
+
+boolean canDrop = canDrop(expr, columnStatisticsMap, 
footer.getBlocks().get(rowGroupIndex).getRowCount(), fragmentContext, 
fragmentContext.getFunctionRegistry());
+return canDrop;
+  }
+
+
+  public static boolean canDrop(ParquetFilterPredicate parquetPredicate, 
Map columnStatisticsMap, long rowCount) {
+boolean canDrop = false;
+if (parquetPredicate != null) {
+  RangeExprEvaluator rangeExprEvaluator = new 
RangeExprEvaluator(columnStatisticsMap, rowCount);
+  canDrop = parquetPredicate.canDrop(rangeExprEvaluator);
+}
+return canDrop;
+  }
+
+
+  public static boolean canDrop(LogicalExpression expr, Map columnStatisticsMap,
+  long rowCount, UdfUtilities udfUtilities, 
FunctionImplementationRegistry functionImplementationRegistry) {
+ErrorCollector errorCollector = new ErrorCollectorImpl();
+LogicalExpression materializedFilter = 
ExpressionTreeMaterializer.materializeFilterExpr(
+expr, columnStatisticsMap, errorCollector, 
functionImplementationRegistry);
+
+if (errorCollector.hasErrors()) {
   

[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86459835
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetFooterStatCollector.java
 ---
@@ -0,0 +1,176 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet.stat;
+
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.ParquetOutputRecordWriter;
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+import 
org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.LongStatistics;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.joda.time.DateTimeConstants;
+import org.joda.time.DateTimeUtils;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+public class ParquetFooterStatCollector implements ColumnStatCollector {
--- End diff --

Added stopwatch to stat collector.

Also, the pushdown rule has a stopwatch to log the elapsed time for the 
entire pruning process. That will give us some sense how much overhead filter 
pruning process introduces. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Time for a 1.9 Release?

2016-11-03 Thread Sudheesh Katkam
Gentle reminder that all check-ins should be done by tomorrow. Please see
the latest statuses of commits that we are targeting:

https://docs.google.com/spreadsheets/d/1UJSXLrfUNZwUnx_
JzkwAcXSxmcbG7meBDad6ZTxlSmw

Thank you,
Sudheesh


On Tue, Nov 1, 2016 at 11:19 AM, Sudheesh Katkam 
wrote:

> The current list of candidate commits for the release is here:
>
> https://docs.google.com/spreadsheets/d/1UJSXLrfUNZwUnx_
> JzkwAcXSxmcbG7meBDad6ZTxlSmw
>
>
> On Mon, Oct 31, 2016 at 8:53 AM, Subbu Srinivasan  > wrote:
>
>> +1.
>>
>> On Sun, Oct 30, 2016 at 10:23 PM, Paul Rogers 
>> wrote:
>>
>> > For release numbers, 1.10 (then 1.11, 1.12, …) seems like a good idea.
>> >
>> > At first it may seem odd to go to 1.10 from 1.9. Might people get
>> confused
>> > between 1.10 and 1.1.0? But, there is precedence. Tomcat’s latest
>> 7-series
>> > release is 7.0.72. Java is on 8u112. And so on.
>> >
>> > I like the idea of moving to 2.0 later when the team introduces a major
>> > change, rather than by default just because the numbers roll around. For
>> > example, Hadoop when to 2.x when YARN was introduced. Impala appears to
>> > have moved to 2.0 when they added Spill to disk for some (all?)
>> operators.
>> >
>> > - Paul
>> >
>> > > On Oct 28, 2016, at 10:34 AM, Sudheesh Katkam 
>> > wrote:
>> > >
>> > > Hi Drillers,
>> > >
>> > > We have a reasonable number of fixes and features since the last
>> release
>> > > [1]. Releasing itself takes a while; so I propose we start the 1.9
>> > release
>> > > process.
>> > >
>> > > I volunteer as the release manager, unless there are objections.
>> > >
>> > > We should also discuss what the release version number should be after
>> > 1.9.
>> > >
>> > > Thank you,
>> > > Sudheesh
>> > >
>> > > [1] https://issues.apache.org/jira/browse/DRILL/fixforversion/
>> 12337861
>> >
>> >
>>
>
>


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86465657
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java
 ---
@@ -0,0 +1,146 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet.stat;
+
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.store.parquet.Metadata;
+import org.apache.drill.exec.store.parquet.ParquetGroupScan;
+import org.apache.parquet.column.statistics.BinaryStatistics;
+import org.apache.parquet.column.statistics.DoubleStatistics;
+import org.apache.parquet.column.statistics.FloatStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.LongStatistics;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.schema.OriginalType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.joda.time.DateTimeConstants;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+public class ParquetMetaStatCollector implements  ColumnStatCollector{
+  private  final Metadata.ParquetTableMetadataBase parquetTableMetadata;
+  private  final List 
columnMetadataList;
+  final Map implicitColValues;
+
+  public ParquetMetaStatCollector(Metadata.ParquetTableMetadataBase 
parquetTableMetadata, List 
columnMetadataList, Map implicitColValues) {
+this.parquetTableMetadata = parquetTableMetadata;
+this.columnMetadataList = columnMetadataList;
+this.implicitColValues = implicitColValues;
+  }
+
+  @Override
+  public Map collectColStat(Set 
fields) {
+// map from column to ColumnMetadata
+final Map columnMetadataMap = new 
HashMap<>();
+
+// map from column name to column statistics.
+final Map statMap = new HashMap<>();
+
+for (final Metadata.ColumnMetadata columnMetadata : 
columnMetadataList) {
+  SchemaPath schemaPath = 
SchemaPath.getCompoundPath(columnMetadata.getName());
+  columnMetadataMap.put(schemaPath, columnMetadata);
+}
+
+for (final SchemaPath schemaPath : fields) {
+  final PrimitiveType.PrimitiveTypeName primitiveType;
+  final OriginalType originalType;
+
+  final Metadata.ColumnMetadata columnMetadata = 
columnMetadataMap.get(schemaPath);
+
+  if (columnMetadata != null) {
+final Object min = columnMetadata.getMinValue();
+final Object max = columnMetadata.getMaxValue();
+final Long numNull = columnMetadata.getNulls();
+
+primitiveType = 
this.parquetTableMetadata.getPrimitiveType(columnMetadata.getName());
+originalType = 
this.parquetTableMetadata.getOriginalType(columnMetadata.getName());
+final Integer repetitionLevel = 
this.parquetTableMetadata.getRepetitionLevel(columnMetadata.getName());
+
+statMap.put(schemaPath, getStat(min, max, numNull, primitiveType, 
originalType, repetitionLevel));
+  } else {
+final String columnName = schemaPath.getRootSegment().getPath();
+if (implicitColValues.containsKey(columnName)) {
--- End diff --

Without knowledge of implicit columns, expression materialization will 
treat dir0 in dir0 = 1995 as NULLEXPRESSION, and could not differentiate from a 
regular non-exist column.  A condition on a regular non-exist column will 
always lead to canDrop = true. 

That's the main reason we have to pass in the list of implicit columns when 
do expression materialization.  

In addition to implicit column name, we also pass in implicit column 
values, and wrap them in Statistics instance with min and max having same 
value.  That is done to expand the possibility of pruning. 

[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread vdiravka
Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86477598
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
--- End diff --

`ParquetTableMetadata_v2` is used mostly when the parquet meta cache file 
created. 
For reading the metadata cache file or metadta footer this class is not 
used except the case, when we read one parquet file or parquet folder, an empty 
instance of this class is created, but used only columnTypeInfo field. And the 
drillVersion or writerVersion weren't used. That's why everything worked but it 
was not right. I added an empty constructor (as I made earlier, actually master 
version). 
And added initializing this fields across constructors parameters. I think 
it is more right and looks like this is what you were talking about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread vdiravka
Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86478118
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java
 ---
@@ -40,6 +40,8 @@
 public class ParquetWriter extends AbstractWriter {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);
 
+  public static final int WRITER_VERSION = 2;
--- End diff --

I copied this comment into the project.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #644: DRILL-4980: Upgrading of the approach of parquet da...

2016-11-03 Thread vdiravka
Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86477805
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
@@ -78,7 +78,7 @@
   private static final int MAXIMUM_RECORD_COUNT_FOR_CHECK = 1;
 
   public static final String DRILL_VERSION_PROPERTY = "drill.version";
-  public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct";
+  public static final String WRITER_VERSION_PROPERTY = 
"parquet-writer.version";
--- End diff --

Done. Also parquet files for unit tests were regenerated as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-4997) Perform parquet filter pushdown in logical planning.

2016-11-03 Thread Jinfeng Ni (JIRA)
Jinfeng Ni created DRILL-4997:
-

 Summary: Perform parquet filter pushdown in logical planning. 
 Key: DRILL-4997
 URL: https://issues.apache.org/jira/browse/DRILL-4997
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Reporter: Jinfeng Ni


DRILL-1950 put parquet filter pushdown rule in physical planning phase, after 
all join order related tasks have been done.  Main reason for doing this is we 
want to reduce the performance regression possibility caused by a different 
join order, as a result of reduced row count in scan operator. 

Ideally this should be done in logical planning, before join order planning is 
done. Before we can make such change, we have to figure out how to adjust the 
selectivity estimation of filter operator, after filter is pushed down to scan.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4998) Perform parquet filter pushdown in logical planning.

2016-11-03 Thread Jinfeng Ni (JIRA)
Jinfeng Ni created DRILL-4998:
-

 Summary: Perform parquet filter pushdown in logical planning.
 Key: DRILL-4998
 URL: https://issues.apache.org/jira/browse/DRILL-4998
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Reporter: Jinfeng Ni


DRILL-1950 put parquet filter pushdown rule in physical planning phase, after 
all join order related tasks have been done.  Main reason for doing this is we 
want to reduce the performance regression possibility caused by a different 
join order, as a result of reduced row count in scan operator. 

Ideally this should be done in logical planning, before join order planning is 
done. Before we can make such change, we have to figure out how to adjust the 
selectivity estimation of filter operator, after filter is pushed down to scan.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86487867
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/PlannerPhase.java ---
@@ -346,6 +355,20 @@ static RuleSet getPruneScanRules(OptimizerRulesContext 
optimizerRulesContext) {
   }
 
   /**
+   *   Get an immutable list of partition pruning rules that will be used 
in logical planning.
+   */
+  static RuleSet getPhysicalPruneScanRules(OptimizerRulesContext 
optimizerRulesContext) {
+final ImmutableSet pruneRules = 
ImmutableSet.builder()
+.add(
+
ParquetPushDownFilter.getFilterOnProject(optimizerRulesContext),
--- End diff --

Opened DRILL-4998, and put comments to explain why we put this rule in 
physical planning phase. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86488578
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/RangeExprEvaluator.java
 ---
@@ -0,0 +1,282 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.stat;
+
+import com.google.common.base.Preconditions;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.expression.FunctionHolderExpression;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.expression.fn.CastFunctions;
+import org.apache.drill.common.expression.fn.FuncHolder;
+import org.apache.drill.common.expression.visitors.AbstractExprVisitor;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.expr.DrillSimpleFunc;
+import org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder;
+import org.apache.drill.exec.expr.fn.interpreter.InterpreterEvaluator;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.Float4Holder;
+import org.apache.drill.exec.expr.holders.Float8Holder;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.store.parquet.stat.ColumnStatistics;
+import org.apache.drill.exec.vector.ValueHolderHelper;
+import org.apache.parquet.column.statistics.DoubleStatistics;
+import org.apache.parquet.column.statistics.FloatStatistics;
+import org.apache.parquet.column.statistics.IntStatistics;
+import org.apache.parquet.column.statistics.LongStatistics;
+import org.apache.parquet.column.statistics.Statistics;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+public class RangeExprEvaluator extends AbstractExprVisitor {
+  static final Logger logger = 
LoggerFactory.getLogger(RangeExprEvaluator.class);
+
+  private final Map columnStatMap;
+  private final long rowCount;
+
+  public RangeExprEvaluator(final Map 
columnStatMap, long rowCount) {
+this.columnStatMap = columnStatMap;
+this.rowCount = rowCount;
+  }
+
+  public long getRowCount() {
+return this.rowCount;
+  }
+
+  @Override
+  public Statistics visitUnknown(LogicalExpression e, Void value) throws 
RuntimeException {
+if (e instanceof TypedFieldExpr) {
+  TypedFieldExpr fieldExpr = (TypedFieldExpr) e;
+  final ColumnStatistics columnStatistics = 
columnStatMap.get(fieldExpr.getPath());
+  if (columnStatistics != null) {
+return columnStatistics.getStatistics();
+  } else {
+// field does not exist.
+
Preconditions.checkArgument(fieldExpr.getMajorType().equals(Types.OPTIONAL_INT));
+IntStatistics intStatistics = new IntStatistics();
+intStatistics.setNumNulls(rowCount); // all values are nulls
+return intStatistics;
+  }
+}
+return null;
+  }
+
+  @Override
+  public Statistics visitIntConstant(ValueExpressions.IntExpression expr, 
Void value) throws RuntimeException {
+return getStatistics(expr.getInt());
+  }
+
+  @Override
+  public Statistics visitLongConstant(ValueExpressions.LongExpression 
expr, Void value) throws RuntimeException {
+return getStatistics(expr.getLong());
+  }
+
+  @Override
+  public Statistics visitFloatConstant(ValueExpressions.FloatExpression 
expr, Void value) throws RuntimeException {
+return getStatistics(expr.getFloat());
+  }
+
+  @Override
+  public Statistics visitDoubleConstant(ValueExpre

[jira] [Created] (DRILL-4999) Metadata Cache Pruning not taking place when we have both directory partitioning and auto partitioned parquet files

2016-11-03 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-4999:


 Summary: Metadata Cache Pruning not taking place when we have both 
directory partitioning and auto partitioned parquet files
 Key: DRILL-4999
 URL: https://issues.apache.org/jira/browse/DRILL-4999
 Project: Apache Drill
  Issue Type: Bug
  Components: Metadata, Query Planning & Optimization
Reporter: Rahul Challapalli


git.commit.id.abbrev=17b9648

Metadata cache pruning is not taking place in the below scenario
{code}
0: jdbc:drill:zk=10.10.100.190:5181> explain plan for select distinct 
i_rec_end_date from 
dfs.`/drill/testdata/parquet_date/auto_partition/item_single` where dir0='1.9' 
and i_rec_start_date = date '1997-10-27';
+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  Project(i_rec_end_date=[$0])
00-02HashAgg(group=[{0}])
00-03  Project(i_rec_end_date=[$2])
00-04SelectionVectorRemover
00-05  Filter(condition=[AND(=($0, '1.9'), =($1, 1997-10-27))])
00-06Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath 
[path=/drill/testdata/parquet_date/auto_partition/item_single/1.9/0_0_1.parquet]],
 selectionRoot=/drill/testdata/parquet_date/auto_partition/item_single, 
numFiles=1, usedMetadataFile=true, 
cacheFileRoot=/drill/testdata/parquet_date/auto_partition/item_single, 
columns=[`dir0`, `i_rec_start_date`, `i_rec_end_date`]]])
{code}

I attached the data set and the log file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #637: Drill 1950 : Parquet row group filter pushdown.

2016-11-03 Thread jinfengni
Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/637#discussion_r86489899
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetPredicates.java
 ---
@@ -0,0 +1,334 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.stat;
+
+import org.apache.drill.common.expression.BooleanOperator;
+import org.apache.drill.common.expression.ExpressionPosition;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.LogicalExpressionBase;
+import org.apache.drill.common.expression.visitors.ExprVisitor;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+public abstract  class ParquetPredicates {
+  public static abstract  class ParquetCompPredicate extends 
LogicalExpressionBase implements ParquetFilterPredicate {
+protected final LogicalExpression left;
+protected final LogicalExpression right;
+
+public ParquetCompPredicate(LogicalExpression left, LogicalExpression 
right) {
+  super(left.getPosition());
+  this.left = left;
+  this.right = right;
+}
+
+@Override
+public Iterator iterator() {
+  final List args = new ArrayList<>();
+  args.add(left);
+  args.add(right);
+  return args.iterator();
+}
+
+@Override
+public  T accept(ExprVisitor 
visitor, V value) throws E {
+  return visitor.visitUnknown(this, value);
+}
+
+  }
+
+  public static abstract class ParquetBooleanPredicate extends 
BooleanOperator implements ParquetFilterPredicate {
+public ParquetBooleanPredicate(String name, List 
args, ExpressionPosition pos) {
+  super(name, args, pos);
+}
+
+@Override
+public  T accept(ExprVisitor 
visitor, V value) throws E {
+  return visitor.visitBooleanOperator(this, value);
+}
+  }
+
+  public static class AndPredicate extends ParquetBooleanPredicate {
+public AndPredicate(String name, List args, 
ExpressionPosition pos) {
+  super(name, args, pos);
+}
+
+@Override
+public boolean canDrop(RangeExprEvaluator evaluator) {
+  // "and" : as long as one branch is OK to drop, we can drop it.
+  for (LogicalExpression child : this) {
+if (((ParquetFilterPredicate) child).canDrop(evaluator)) {
+  return true;
+}
+  }
+  return false;
+}
+  }
+
+  public static class OrPredicate extends ParquetBooleanPredicate {
+public OrPredicate(String name, List args, 
ExpressionPosition pos) {
+  super(name, args, pos);
+}
+
+@Override
+public boolean canDrop(RangeExprEvaluator evaluator) {
+  for (LogicalExpression child : this) {
+// "long" : as long as one branch is NOT ok to drop, we can NOT 
drop it.
+if (! ((ParquetFilterPredicate) child).canDrop(evaluator)) {
+  return false;
+}
+  }
+
+  return true;
+}
+  }
+
+  // is this column chunk composed entirely of nulls?
+  // assumes the column chunk's statistics is not empty
+  protected static boolean isAllNulls(Statistics stat, long rowCount) {
+return stat.getNumNulls() == rowCount;
+  }
+
+  // are there any nulls in this column chunk?
+  // assumes the column chunk's statistics is not empty
+  protected static boolean hasNulls(Statistics stat) {
+return stat.getNumNulls() > 0;
+  }
+
+  /**
+   * EQ (=) predicate
+   */
+  public static class EqualPredicate extends ParquetCompPredicate {
+public EqualPredicate(LogicalExpression left, LogicalExpression right) 
{
+  super(left, righ

[jira] [Created] (DRILL-5000) Dir Pruning + Auto Partitioning : AssertionError in FindPartitionConditions

2016-11-03 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5000:


 Summary: Dir Pruning + Auto Partitioning : AssertionError in 
FindPartitionConditions
 Key: DRILL-5000
 URL: https://issues.apache.org/jira/browse/DRILL-5000
 Project: Apache Drill
  Issue Type: Bug
  Components: Metadata, Query Planning & Optimization
Reporter: Rahul Challapalli


git.commit.id.abbrev=190d5d4

The below query fails with an Assertion Error
{code}
select i_rec_start_date, i_rec_end_date from item where dir0='1.9' or 
dir0='1.2' and i_rec_start_date = date '1997-10-27';
Error: SYSTEM ERROR: AssertionError


[Error Id: 4271ca4f-b7f1-40d9-b203-f53ff6de7e04 on qa-node191.qa.lab:31010] 
(state=,code=0)
{code}

I attached the log files and the data set



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5001) Join only supports implicit casts error even when I have explicit cast

2016-11-03 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5001:


 Summary: Join only supports implicit casts error even when I have 
explicit cast
 Key: DRILL-5001
 URL: https://issues.apache.org/jira/browse/DRILL-5001
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Reporter: Rahul Challapalli


git.commit.id.abbrev=190d5d4

The below query fails even when I had an explicit cast on the right hand side 
of the join condition. The data also contains a metadata cache
{code}
select
  a.int_col,
  b.date_col 
from
  dfs. `/ drill / testdata / parquet_date / metadata_cache / mixed / 
fewtypes_null_large ` a 
  inner join
(
  select
* 
  from
dfs. `/ drill / testdata / parquet_date / metadata_cache / mixed / 
fewtypes_null_large ` 
  where
dir0 = '1.2' 
and date_col > '1996-03-07' 
)
b 
on a.date_col = cast(date_add(b.date_col, 5) as date) 
where
  a.int_col = 7 
  and a.dir0 = '1.9' 
group by
  a.int_col,
  b.date_col;

Error: SYSTEM ERROR: DrillRuntimeException: Join only supports implicit casts 
between 1. Numeric data
 2. Varchar, Varbinary data 3. Date, Timestamp data Left type: DATE, Right 
type: VARCHAR. Add explicit casts to avoid this error

Fragment 2:0

[Error Id: a1b26420-af35-4892-9a87-d9b04e4423dc on qa-node190.qa.lab:31010] 
(state=,code=0)
{code}

I attached the data and the log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)