Re: Handling schema change in blocking operators

2018-11-06 Thread Paul Rogers
Hi Boaz,

As noted earlier, it would be wonderful if Drill could handle schema changes on 
the fly, using only the information in the files as they are read, and with 
only a few code changes. Alas, such is not the case.

Question: is the goal to have schema changes somewhat less often (but they 
still occur in the cases we discussed)? Or, is it to have rock solid, reliable 
results? IHMO, the team can produce rock solid results, with less work, using 
the schema solution. Everybody understands schema and will appreciate the 
ability to use a simple solution rather than working through complex schema 
change rules.

Schema-free still exists: it is the reward for folks with clean data sets. 
Folks with dirty data have to pay the cost of a schema to clean up their mess 
(in lieu of ETL into Parquet, which is more costly.)

Let's again look at the core problem. Remember the time dimension: files are 
read in random order on random nodes. This means no two runs of a query will 
see data in the same order either in the scanner or in downstream operators.

Think through this scenario:

* 20 files, each with 100K rows (2 million rows total, at least two record 
batches per file)
* 2 nodes with 5 minor fragments each (10 scanners total)
* Files exhibit the schema changes you suggest
* Files are read in random order on random nodes
* xDBC clients demand that the schema delivered on the first row is the same as 
that delivered on all rows up to the 2 millionth.
* Running the same query (same data) over and over produces the same schema on 
each run.

Thanks much for the use cases. Let's look at them in this context:

   * Column added

   * Fields added in a Json

   * Numeric "enlargement", like INT --> BIGINT, or INT --> DECIMAL, etc.

   * Non-Nullable to Nullable.


All of these appear to have a simple solution if we know the sequence that data 
arrives.

* If a column is added, then if we read the new files first, we'll know the 
type for the older files that don't have the column.

* If there is numeric enlargement and we read the larger size first, we know to 
read the older, smaller values at the larger size.

* If we read the non-nullable values first, we know to treat the older, 
non-nullable values as nullable.

But, let's play the story the other way around. Remember, xDBC clients can't 
handle a schema change: whatever schema we deliver on the first row we must 
continue to use to the last row.

* So, suppose we read the old files first. If a column is missing, but the user 
asked for it, what type is the column? How does the first row know what type 
will appear once Drill gets to the newer files? Or, do we pick a type and force 
the column, when it does appear, into that type?

* The same is true of type enlargement: how will Drill know, when it reads and 
delivers the first rows with a narrow type, that a wider type is coming 100 
files from now? Or, if Drill picked the narrow type, should Drill try to force 
the wider values into that initial narrow type?

There are two choices. First, buffer all the rows before delivering the first 
so that type changes can be made on the buffered data before it is delivered. 
Second, know the final type info up front, in the form of a schema (or hint or 
...)

Thanks for enumerating the many limitations.


* Schema change for xDBC only works if there is a buffering operator that sees 
all rows (as sort) but not one that sees some rows (grouped aggregation).

* Schema change would be very hard on key fields, so a query can still fail 
sometimes. We've not solved the problem, just made it somewhat more obscure.

* Resulting schema order may depend on file read order.

And, I'll throw in one I mentioned before: that a random operator may not have 
the needed context, such as date formats, deciding to convert to VARCHAR vs. 
numeric, etc.

I understand the case for "fail somewhat less often." I simply suggesting the 
team can achieve rock solid results -- and do so at a lower cost/risk than the 
partial solution. (I really want to have to throw away that "Data Engineering" 
chapter in the Drill book that explains all these limitations.)


Thanks,
- Paul

 

On Tuesday, November 6, 2018, 5:50:52 PM PST, Boaz Ben-Zvi 
 wrote:  
 
   Hi Paul,

(_a_)  Having a "schema file" sounds like contradiction to calling Drill 
"schema free"; maybe we could "sweep it under the mat" by creating a new 
convention for scanners, such that if a scanner has multiple files to 
read (e.g. f1.csv, f2,csv, ...), then is there's some file named 
"MeFirst.csv", it would always be read first !! (With some option to 
skip some of the rows there, like "MeFirst0.csv" means skip all the rows).

(_b_) If the schema (hint) is kept somewhere, could it be updated 
automatically by the executing query ? If so, running again a query that 
failed with "schema change" my succeed second time. If there is an issue 
with permissions, maybe each user can keep such cache in its ~/.drill ...

(_c_) Indeed we 

[GitHub] lushuifeng commented on issue #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use

2018-11-06 Thread GitBox
lushuifeng commented on issue #1524: DRILL-6830: Remove 
Hook.REL_BUILDER_SIMPLIFY handler after use
URL: https://github.com/apache/drill/pull/1524#issuecomment-436478375
 
 
   @vvysotskyi There are some type mismatch errors in the tests since 
RelBuilder.simplify is not always 
false so simplifier.simplifyPreservingType will be invoked in some tests.
These errors are reproducible by running one test at a time, say running 
`TestCaseNullableTypes#testCaseNullableTypesVarchar` alone on master, the tests 
are passed if running all tests in TestCaseNullableTypes at once.
   Should I fixed these errors in this PR or create another ticket? 
   I have to say that I am not quite familiar with Calcite.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: [DISCUSS] 1.15.0 release

2018-11-06 Thread Gautam Parai
Hi Vitalii,

Thanks for volunteering to be the release manager!

I am currently working on DRILL-6770(
https://issues.apache.org/jira/browse/DRILL-6770) which is upgrading the
MapR/OJAI versions. We are running into some wrong results issues and I am
actively working on it.

I think we should definitely try to get it in this release. I think I
should be done by the end of next week (11/16).

Thanks,
Gautam

On Tue, Nov 6, 2018 at 12:28 PM Karthikeyan Manivannan 
wrote:

> We should try to get "DRILL-5671: Set secure ACLs (Access Control List) for
> Drill ZK nodes in a secure cluster" into 1.15.
> Already +1ed by a committer but waiting for a +1 from another committer who
> had also participated in the review.
>
> On Tue, Nov 6, 2018 at 9:46 AM Vitalii Diravka  wrote:
>
> > Hi Drillers,
> >
> > It's been 3 months since the last release and it is time to do the next
> > one.
> >
> > I'll volunteer to manage the release :)
> >
> > There are 32 open tickets that are still intended to be included in
> 1.15.0
> > release [1].
> > What do you guys think which tickets do we want to include and what time
> > will it take?
> > If there are any other issues on which work is in progress, that you feel
> > we *must* include in the release, please post in reply to this thread.
> >
> > Based on your input we'll define release cut off date.
> >
> > [1]
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_issues_-3Fjql-3Dproject-2520-253D-2520DRILL-2520AND-2520status-2520in-2520-28Open-252C-2520-2522In-2520Progress-2522-252C-2520Reopened-252C-2520Reviewable-252C-2520Accepted-29-2520AND-2520fixVersion-2520-253D-25201.15.0-2520AND-2520-28component-2520-21-253D-2520Documentation-2520OR-2520component-2520is-2520null-29-2520-2520AND-2520-28labels-2520-21-253D-2520ready-2Dto-2Dcommit-2520OR-2520labels-2520is-2520null-29-2520ORDER-2520BY-2520status-2520DESC-252C-2520updated-2520DESC&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=HlugibuI4IVjs-VMnFvNTcaBtEaDDqE4Ya96cugWqJ8&m=akL-98gphoiivbhujdlv6Tf87UzIw9APJX71G9BDSFw&s=A8aoDFlrsbwTNj6NZPNVHYYMIyqVmfKEeDSM0Jb_ZZc&e=
> >
> > Kind regards
> > Vitalii
> >
>


Re: Handling schema change in blocking operators

2018-11-06 Thread Boaz Ben-Zvi

 Hi Paul,

(_a_)  Having a "schema file" sounds like contradiction to calling Drill 
"schema free"; maybe we could "sweep it under the mat" by creating a new 
convention for scanners, such that if a scanner has multiple files to 
read (e.g. f1.csv, f2,csv, ...), then is there's some file named 
"MeFirst.csv", it would always be read first !! (With some option to 
skip some of the rows there, like "MeFirst0.csv" means skip all the rows).


(_b_) If the schema (hint) is kept somewhere, could it be updated 
automatically by the executing query ? If so, running again a query that 
failed with "schema change" my succeed second time. If there is an issue 
with permissions, maybe each user can keep such cache in its ~/.drill ...


(_c_) Indeed we can't have a general "schema change" solution; however 
we can focus on the low hanging fruit, namely "schema evolution". In 
many cases, the change in the schema is "natural", and we could easily 
adopt the blocking operator. Cases like:


   * Column added

   * Fields added in a Json

   * Numeric "enlargement", like INT --> BIGINT, or INT --> DECIMAL, etc.

   * Non-Nullable to Nullable.

Further ideas:

- A blocking operator has a notion of the current schema; once the 
schema "evolves", it  can either "pause and convert all the old ones", 
or work lazily -- just track the old ones, and make changes as needed 
(e.g., work with two sets of generated code, as needed).


- As these changes are rare, we could restrict to handling only "one 
active change at a time"


- Memory management could be an issue (with "pause and convert"), but 
may be simple if the computation starts using the newer bigger batch 
size (for "lazy").


- We should distinguish between "key" columns, and "non-key" columns 
(for Sort / Hash-Join) or "value" columns in the Hash-Agg. One 
possibility for the Hash operators is to have some hash function 
compatibility, like  HashFunc( INT 567 ) == HashFunc( BIGINT 567 ), to 
simplify (and avoid rehashing).


    Thanks,

 Boaz

On 11/6/18 12:25 PM, Paul Rogers wrote:

HI Aman,

I would completely agree with the analysis -- except for the fact that we can't 
create a general solution, only a patchwork of incomplete ad-hoc solutions. The 
question is not whether it would be useful to have a general solution (it 
would), rather whether it is technically possible without some help from the 
user (it is not, IMHO.)

I like the scenario presented, gives us a concrete example. Let's say an IoT 
device produced files with an evolving schema. A field in a JSON file started 
as BIGINT, later because DOUBLE, and finally became VARCHAR. What should Drill 
do? Maybe the values are:
1
1.1
1.33

The change of types might represent the idea that the above are money amounts, 
and the only way to represent values exactly is with a string (in JSON) and 
with a DECIMAL in Drill.

Or, maybe the values are:
1
1.1
1.1rev3

Which showed that the value is a version string. Early developers thought to 
use an integer, later they wanted minor versions, and even later they realized 
they needed a patch value. The correct value type is VARCHAR.

Once can also invent a scenario in which the proper type is BIGINT, DOUBLE or 
even TIMESTAMP.

Since Drill can't know the user's intention, we can invest quite a bit of 
effort and still not solve the problem.

What is the alternative?

Suppose we simply let the query fail when we see a schema change, but we point 
the user to a solution:

Query failed: Schema conflict on column `foo`: BIGINT and DOUBLE.
Use a schema file to resolve the ambiguity.
See 
https://urldefense.proofpoint.com/v2/url?u=http-3A__drill.apache.org_docs_schema-2Dfile&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=PqKay2uOMZUqopDRKNfBtZSlsp2meGOxWNAVHxHnXCk&m=DOzeipgsStxUnotQemlm6judvWUdbAdPuvMBBYh4ilU&s=_ezJ4X476FCrf8ouHloYk1NLS91bs7ITW7u36molPmU&e=
 for more information.

Now, the user is in control: we stated what we can and cannot do and gave the 
user the option to decide on the data type.

This is a special case of other use cases: it works just as well for specifying 
CSV types, refining JSON types and so on. A single solution that solves 
multiple problems.

This approach also solves the problem that the JDBC and ODBC clients can't 
handle a schema that changes during processing. (The native Drill client can, 
which is a rather cool feature. xDBC hasn't caught up, so we have to deal with 
them as they are.)

In fact, Drill could then say: if your data is nice and clean, query it without 
a schema since the data speaks for itself. If, however, your data is messy (as 
real-word data tends to be), just provide a schema to explain the intent and 
Drill will do the right thing.

And, again, if the team tried the schema solution first, you'd be in a much 
better position to see what additional benefits could be had by trying to guess 
the type (and solving the time-travel issue.) (This is the lazy approach: do 
the least amount of work...)

In fact, it may turn out th

[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231341908
 
 

 ##
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackSchema.java
 ##
 @@ -0,0 +1,114 @@
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.exception.SchemaChangeRuntimeException;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField.Builder;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.security.AccessControlException;
+
+import com.google.protobuf.TextFormat;
+import com.google.protobuf.TextFormat.ParseException;
+
+public class MsgpackSchema {
+  public static final String SCHEMA_FILE_NAME = ".schema.proto";
+
+  @SuppressWarnings("unused")
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackSchema.class);
+
+  private DrillFileSystem fileSystem;
+
+  public MsgpackSchema(DrillFileSystem fileSystem) {
+this.fileSystem = fileSystem;
+  }
+
+  public MaterializedField load(Path schemaLocation) throws 
AccessControlException, FileNotFoundException, IOException {
+MaterializedField previousMapField = null;
+if (schemaLocation != null && fileSystem.exists(schemaLocation)) {
+  try (FSDataInputStream in = fileSystem.open(schemaLocation)) {
+String schemaData = IOUtils.toString(in);
+Builder newBuilder = SerializedField.newBuilder();
+try {
+  TextFormat.merge(schemaData, newBuilder);
+} catch (ParseException e) {
+  throw new DrillRuntimeException("Failed to read schema file: " + 
schemaLocation, e);
+}
+SerializedField read = newBuilder.build();
+previousMapField = MaterializedField.create(read);
+  }
+}
+return previousMapField;
+  }
+
+  public void save(MaterializedField mapField, Path schemaLocation) throws 
IOException {
+try (FSDataOutputStream out = fileSystem.create(schemaLocation, true)) {
+  SerializedField serializedMapField = mapField.getSerializedField();
+  String data = TextFormat.printToString(serializedMapField);
+  IOUtils.write(data, out);
+}
+  }
+
+  public MaterializedField merge(MaterializedField existingField, 
MaterializedField newField) {
+Preconditions.checkArgument(existingField.getType().getMinorType() == 
MinorType.MAP,
+"Field " + existingField + " is not a MAP type.");
+Preconditions.checkArgument(existingField.hasSameTypeAndMode(newField),
+"Field " + existingField + " and " + newField + " not same.");
+MaterializedField merged = existingField.clone();
+privateMerge(merged, newField);
+return merged;
+  }
+
+  private void privateMerge(MaterializedField existingField, MaterializedField 
newField) {
+Preconditions.checkArgument(existingField.getType().getMinorType() == 
MinorType.MAP,
+"Field " + existingField + " is not a MAP type.");
+for (MaterializedField newChild : newField.getChildren()) {
+  String newChildName = newChild.getName();
+  MaterializedField foundExistingChild = getFieldByName(newChildName, 
existingField);
+  if (foundExistingChild != null) {
+if (foundExistingChild.hasSameTypeAndMode(newChild)) {
+  if (foundExistingChild.getType().getMinorType() == MinorType.MAP) {
+privateMerge(foundExistingChild, newChild);
+  } // else we already have it
+} else {
+  // error
+  throw new SchemaChangeRuntimeException("Not the same schema for " + 
foundExistingChild + " and " + newChild);
+}
+  } else {
+existingField.addChild(newChild.clone());
+  }
+}
+  }
+
+  private MaterializedField getFieldByName(String newChildName, 
MaterializedField existingField) {
+for (MaterializedField f : existingField.getChildren()) {
+  if (newChildName.equalsIgnoreCase(f.getName())) {
+return f;
+  }
+}
+return null;
+  }
+
+  public Path findSchemaFile(Path dir) throws IOException {
+int MAX_DEPTH = 5;
+int depth = 0;
+while (dir != null && depth < MAX_DEPTH) {
 
 Review comment:
   yes, I've already changed that. Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub an

[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231341649
 
 

 ##
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackSchema.java
 ##
 @@ -0,0 +1,114 @@
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.exception.SchemaChangeRuntimeException;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField.Builder;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.security.AccessControlException;
+
+import com.google.protobuf.TextFormat;
+import com.google.protobuf.TextFormat.ParseException;
+
+public class MsgpackSchema {
+  public static final String SCHEMA_FILE_NAME = ".schema.proto";
+
+  @SuppressWarnings("unused")
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackSchema.class);
+
+  private DrillFileSystem fileSystem;
+
+  public MsgpackSchema(DrillFileSystem fileSystem) {
+this.fileSystem = fileSystem;
+  }
+
+  public MaterializedField load(Path schemaLocation) throws 
AccessControlException, FileNotFoundException, IOException {
+MaterializedField previousMapField = null;
+if (schemaLocation != null && fileSystem.exists(schemaLocation)) {
+  try (FSDataInputStream in = fileSystem.open(schemaLocation)) {
+String schemaData = IOUtils.toString(in);
+Builder newBuilder = SerializedField.newBuilder();
+try {
+  TextFormat.merge(schemaData, newBuilder);
+} catch (ParseException e) {
+  throw new DrillRuntimeException("Failed to read schema file: " + 
schemaLocation, e);
+}
+SerializedField read = newBuilder.build();
+previousMapField = MaterializedField.create(read);
+  }
+}
+return previousMapField;
+  }
+
+  public void save(MaterializedField mapField, Path schemaLocation) throws 
IOException {
+try (FSDataOutputStream out = fileSystem.create(schemaLocation, true)) {
+  SerializedField serializedMapField = mapField.getSerializedField();
+  String data = TextFormat.printToString(serializedMapField);
+  IOUtils.write(data, out);
+}
+  }
+
+  public MaterializedField merge(MaterializedField existingField, 
MaterializedField newField) {
+Preconditions.checkArgument(existingField.getType().getMinorType() == 
MinorType.MAP,
+"Field " + existingField + " is not a MAP type.");
+Preconditions.checkArgument(existingField.hasSameTypeAndMode(newField),
+"Field " + existingField + " and " + newField + " not same.");
+MaterializedField merged = existingField.clone();
+privateMerge(merged, newField);
+return merged;
+  }
+
+  private void privateMerge(MaterializedField existingField, MaterializedField 
newField) {
+Preconditions.checkArgument(existingField.getType().getMinorType() == 
MinorType.MAP,
+"Field " + existingField + " is not a MAP type.");
+for (MaterializedField newChild : newField.getChildren()) {
+  String newChildName = newChild.getName();
+  MaterializedField foundExistingChild = getFieldByName(newChildName, 
existingField);
+  if (foundExistingChild != null) {
+if (foundExistingChild.hasSameTypeAndMode(newChild)) {
+  if (foundExistingChild.getType().getMinorType() == MinorType.MAP) {
+privateMerge(foundExistingChild, newChild);
+  } // else we already have it
+} else {
+  // error
 
 Review comment:
   The fields are all nullable, correct. It supports many "sizes" for Number 
but in the java library I'm using it rolls all those up into Number. So yes 
there's just a few types to handle. But there is a distinction between Number 
and Float. So it's not as difficult as in JSON where you might see "1", "2" 
then "3.1".
   
   Again I will add more comments and documentation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231340882
 
 

 ##
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackSchema.java
 ##
 @@ -0,0 +1,114 @@
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.exception.SchemaChangeRuntimeException;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField.Builder;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.security.AccessControlException;
+
+import com.google.protobuf.TextFormat;
+import com.google.protobuf.TextFormat.ParseException;
+
+public class MsgpackSchema {
+  public static final String SCHEMA_FILE_NAME = ".schema.proto";
+
+  @SuppressWarnings("unused")
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackSchema.class);
+
+  private DrillFileSystem fileSystem;
+
+  public MsgpackSchema(DrillFileSystem fileSystem) {
+this.fileSystem = fileSystem;
+  }
+
+  public MaterializedField load(Path schemaLocation) throws 
AccessControlException, FileNotFoundException, IOException {
+MaterializedField previousMapField = null;
+if (schemaLocation != null && fileSystem.exists(schemaLocation)) {
+  try (FSDataInputStream in = fileSystem.open(schemaLocation)) {
+String schemaData = IOUtils.toString(in);
+Builder newBuilder = SerializedField.newBuilder();
+try {
+  TextFormat.merge(schemaData, newBuilder);
+} catch (ParseException e) {
+  throw new DrillRuntimeException("Failed to read schema file: " + 
schemaLocation, e);
+}
+SerializedField read = newBuilder.build();
+previousMapField = MaterializedField.create(read);
+  }
+}
+return previousMapField;
+  }
+
+  public void save(MaterializedField mapField, Path schemaLocation) throws 
IOException {
+try (FSDataOutputStream out = fileSystem.create(schemaLocation, true)) {
+  SerializedField serializedMapField = mapField.getSerializedField();
+  String data = TextFormat.printToString(serializedMapField);
+  IOUtils.write(data, out);
+}
+  }
+
+  public MaterializedField merge(MaterializedField existingField, 
MaterializedField newField) {
 
 Review comment:
   I don't handle multi-files in parallel right now.
   
   Schema evolution is also not handled. You will recall from my questions in 
the dev mailing list, I have come across cases where an array would have 
VARCHAR and VARBINARY. In that case I would get a schema with array of type 
VARCHAR (having skipped over the error of not being able to write VARBINARY 
into that array.
   
   By looking at the logs I can see these values are being skipped. I then 
manually update the schema to tell it I want that array to be VARBINARY. From 
now one when I read the VARCHAR values are coerced into VARBINARY.
   
   So for now I use the schema learning to discover the bulk of the structure 
and fix any edge cases manually.
   
   But I think the bases are there to build a smarter discovery mechanism where 
it could evolve the schema as it sees more of the data.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231339350
 
 

 ##
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackSchema.java
 ##
 @@ -0,0 +1,114 @@
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+
+import org.apache.commons.io.IOUtils;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.exception.SchemaChangeRuntimeException;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField;
+import org.apache.drill.exec.proto.UserBitShared.SerializedField.Builder;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.security.AccessControlException;
+
+import com.google.protobuf.TextFormat;
+import com.google.protobuf.TextFormat.ParseException;
+
+public class MsgpackSchema {
+  public static final String SCHEMA_FILE_NAME = ".schema.proto";
+
+  @SuppressWarnings("unused")
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackSchema.class);
+
+  private DrillFileSystem fileSystem;
+
+  public MsgpackSchema(DrillFileSystem fileSystem) {
+this.fileSystem = fileSystem;
+  }
+
+  public MaterializedField load(Path schemaLocation) throws 
AccessControlException, FileNotFoundException, IOException {
+MaterializedField previousMapField = null;
+if (schemaLocation != null && fileSystem.exists(schemaLocation)) {
+  try (FSDataInputStream in = fileSystem.open(schemaLocation)) {
+String schemaData = IOUtils.toString(in);
+Builder newBuilder = SerializedField.newBuilder();
+try {
+  TextFormat.merge(schemaData, newBuilder);
+} catch (ParseException e) {
+  throw new DrillRuntimeException("Failed to read schema file: " + 
schemaLocation, e);
+}
+SerializedField read = newBuilder.build();
+previousMapField = MaterializedField.create(read);
+  }
+}
+return previousMapField;
+  }
+
+  public void save(MaterializedField mapField, Path schemaLocation) throws 
IOException {
 
 Review comment:
   No it does not handle multi-thread cases. Right now I turn on learning mode 
and submit queries of the form
   select * from dfs.root.`dir/aSingleFile.mp`
   I can scan more files but only one at a time. I then copy that schema file 
in all directories container that type of data. That's fine for my purpose 
right now. It's a bit manual and not ready for prime time as it is.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231338575
 
 

 ##
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java
 ##
 @@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.BitSet;
+import java.util.EnumMap;
+import java.util.List;
+
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.msgpack.valuewriter.AbstractValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.ArrayValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.BinaryValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.BooleanValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.ExtensionValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.FloatValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.IntegerValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.MapValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.StringValueWriter;
+import org.apache.drill.exec.vector.complex.fn.FieldSelection;
+import org.apache.drill.exec.vector.complex.writer.BaseWriter;
+import org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter;
+import org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import org.msgpack.core.MessageInsufficientBufferException;
+import org.msgpack.core.MessagePack;
+import org.msgpack.core.MessageUnpacker;
+import org.msgpack.value.MapValue;
+import org.msgpack.value.Value;
+import org.msgpack.value.ValueType;
+
+import io.netty.buffer.DrillBuf;
+
+public class MsgpackReader {
+
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackReader.class);
+  private final List columns;
+  private final FieldSelection rootSelection;
+  protected MessageUnpacker unpacker;
+  protected MsgpackReaderContext context;
+  private final boolean skipQuery;
+
+  /**
+   * Collection for tracking empty array writers during reading and storing 
them
+   * for initializing empty arrays
+   */
+  private final List emptyArrayWriters = Lists.newArrayList();
+  private boolean allTextMode = false;
+  private MapValueWriter mapValueWriter;
+
+  public MsgpackReader(InputStream stream, MsgpackReaderContext context, 
DrillBuf managedBuf, List columns,
+  boolean skipQuery) {
+
+this.context = context;
+this.context.workBuf = managedBuf;
+this.unpacker = MessagePack.newDefaultUnpacker(stream);
+this.columns = columns;
+this.skipQuery = skipQuery;
+rootSelection = FieldSelection.getFieldSelection(columns);
+EnumMap valueWriterMap = new 
EnumMap<>(ValueType.class);
+valueWriterMap.put(ValueType.ARRAY, new ArrayValueWriter(valueWriterMap, 
emptyArrayWriters));
+valueWriterMap.put(ValueType.FLOAT, new FloatValueWriter());
+valueWriterMap.put(ValueType.INTEGER, new IntegerValueWriter());
+valueWriterMap.put(ValueType.BOOLEAN, new BooleanValueWriter());
+valueWriterMap.put(ValueType.STRING, new StringValueWriter());
+valueWriterMap.put(ValueType.BINARY, new BinaryValueWriter());
+valueWriterMap.put(ValueType.EXTENSION, new ExtensionValueWriter());
 
 Review comment:
   I'll put more comments in the code. Basically this is a switch implemented 
using an EnumMap. In the  ComplexValueWriter I use this switch to lookup what 
class will handle writing a value type. Here's the line of code from that 
writeElement method
   
 valueWriterMap.get(value.getValueType()).write(value, mapWriter, 
fieldName, listWriter, selection, schema);
   So based on the value type I get the corresponding writer class to use.
   
   


This is an automated message from the Apache Git Service.
To respond to the 

[GitHub] bbevens closed pull request #1526: Md 4946: Added Developer Day and User Meet-up links

2018-11-06 Thread GitBox
bbevens closed pull request #1526: Md 4946: Added Developer Day and User 
Meet-up links
URL: https://github.com/apache/drill/pull/1526
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/blog/_posts/2018-10-16-drill-user-meetup.md 
b/blog/_posts/2018-10-16-drill-user-meetup.md
new file mode 100644
index 000..7faae8dce0a
--- /dev/null
+++ b/blog/_posts/2018-10-16-drill-user-meetup.md
@@ -0,0 +1,8 @@
+---
+layout: post
+title: "Drill User Meetup 2018"
+code:
+excerpt: Drill User Meetup coming soon.
+authors: ["bbevens"]
+date: 2018-10-16 19:18:04 UTC
+---
diff --git a/index.html b/index.html
index e0b90d4bd20..f831e8cdb2c 100755
--- a/index.html
+++ b/index.html
@@ -72,9 +72,11 @@ Schema-free SQL Query Engine fo
   {% assign post = site.categories.blog[0] %}
   https://t.co/9752ikEZsO";>{% if post.news_title %}{{ 
post.news_title }}{% else %}{{ post.title }}{% endif %}({% 
include authors.html %})
   {% assign post = site.categories.blog[1] %}
-  {% if post.news_title 
%}{{ post.news_title }}{% else %}{{ post.title }}{% endif %}({% 
include authors.html %})
+  https://www.eventbrite.com/e/apache-drill-developer-day-tickets-52121673328";>{%
 if post.news_title %}{{ post.news_title }}{% else %}{{ post.title }}{% endif 
%}({% include authors.html %})
   {% assign post = site.categories.blog[2] %}
   {% if post.news_title 
%}{{ post.news_title }}{% else %}{{ post.title }}{% endif %}({% 
include authors.html %})
+  {% assign post = site.categories.blog[3] %}
+  {% if post.news_title 
%}{{ post.news_title }}{% else %}{{ post.title }}{% endif %}({% 
include authors.html %})
 
 
   


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-6833) MapRDB queries with Run Time Filters with row_key/Secondary Index Should Support Pushdown

2018-11-06 Thread Gautam Parai (JIRA)
Gautam Parai created DRILL-6833:
---

 Summary: MapRDB queries with Run Time Filters with 
row_key/Secondary Index Should Support Pushdown
 Key: DRILL-6833
 URL: https://issues.apache.org/jira/browse/DRILL-6833
 Project: Apache Drill
  Issue Type: New Feature
Affects Versions: 1.15.0
Reporter: Gautam Parai
Assignee: Gautam Parai
 Fix For: 1.15.0


Drill should push down all row key filters to maprDB for queries that only have 
WHERE conditions on row_keys. In the following example, the query only has 
where clause on row_keys:

select t.mscIdentities from dfs.root.`/user/mapr/MixTable` t where t.row_key=
(select max(convert_fromutf8(i.KeyA.ENTRY_KEY)) from 
dfs.root.`/user/mapr/TableIMSI` i where i.row_key='460021050005636')

row_keys can return at most 1 row. So the physical planning must leverage 
MapRDB row_key push down to execute the sub query, with its results execute the 
outer query. Currently only the inner query is pushed down. The outer query 
requires a table scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6832) Remove old "unmanaged" sort implementation

2018-11-06 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6832:
--

 Summary: Remove old "unmanaged" sort implementation
 Key: DRILL-6832
 URL: https://issues.apache.org/jira/browse/DRILL-6832
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.14.0
Reporter: Paul Rogers


Several releases back Drill introduced a new "managed" external sort that 
enhanced the sort operator's memory management. To be safe, at the time, the 
new version was controlled by an option, with the ability to revert to the old 
version.

The new version has proven to be stable. The time has come to remove the old 
version.

* Remove the implementation in {{physical.impl.xsort}}.
* Move the implementation from {{physical.impl.xsort.managed}} to the parent 
package.
* Remove the conditional code in the batch creator.
* Remove the option that allowed disabling the new version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] jcmcote commented on issue #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on issue #1500: DRILL-6820: Msgpack format reader
URL: https://github.com/apache/drill/pull/1500#issuecomment-436417739
 
 
   > @jcmcote, in 
[HADOOP-13578](https://github.com/apache/hadoop/commit/a0a276162147e843a5a4e028abdca5b66f5118da#diff-8983b1157165c3e54c37877bc01da2ec)
 was added ZStandard Compression to the hadoop library. I think it would be 
better to collaborate with existing well-tested implementation instead of 
introducing the custom one.
   
   Agreed. When will drill pickup the new version of Hadoop. Is that a big deal 
to upgrade the version of Hadoop used?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231300996
 
 

 ##
 File path: 
contrib/format-msgpack/src/main/java/org/apache/drill/exec/store/msgpack/MsgpackReader.java
 ##
 @@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.msgpack;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.BitSet;
+import java.util.EnumMap;
+import java.util.List;
+
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.store.msgpack.valuewriter.AbstractValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.ArrayValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.BinaryValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.BooleanValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.ExtensionValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.FloatValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.IntegerValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.MapValueWriter;
+import org.apache.drill.exec.store.msgpack.valuewriter.StringValueWriter;
+import org.apache.drill.exec.vector.complex.fn.FieldSelection;
+import org.apache.drill.exec.vector.complex.writer.BaseWriter;
+import org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter;
+import org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import org.msgpack.core.MessageInsufficientBufferException;
+import org.msgpack.core.MessagePack;
+import org.msgpack.core.MessageUnpacker;
+import org.msgpack.value.MapValue;
+import org.msgpack.value.Value;
+import org.msgpack.value.ValueType;
+
+import io.netty.buffer.DrillBuf;
+
+public class MsgpackReader {
+
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(MsgpackReader.class);
+  private final List columns;
+  private final FieldSelection rootSelection;
+  protected MessageUnpacker unpacker;
+  protected MsgpackReaderContext context;
+  private final boolean skipQuery;
+
+  /**
+   * Collection for tracking empty array writers during reading and storing 
them
+   * for initializing empty arrays
+   */
+  private final List emptyArrayWriters = Lists.newArrayList();
+  private boolean allTextMode = false;
+  private MapValueWriter mapValueWriter;
+
+  public MsgpackReader(InputStream stream, MsgpackReaderContext context, 
DrillBuf managedBuf, List columns,
+  boolean skipQuery) {
 
 Review comment:
   msgpack files are a sequence of messages. In theory it can be split but you 
would need to scan the file using the msgpack library to know where you could 
split them. How would a csv file be split? Is there code that scans the file 
and determines where the offset should be?  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: [DISCUSS] 1.15.0 release

2018-11-06 Thread Karthikeyan Manivannan
We should try to get "DRILL-5671: Set secure ACLs (Access Control List) for
Drill ZK nodes in a secure cluster" into 1.15.
Already +1ed by a committer but waiting for a +1 from another committer who
had also participated in the review.

On Tue, Nov 6, 2018 at 9:46 AM Vitalii Diravka  wrote:

> Hi Drillers,
>
> It's been 3 months since the last release and it is time to do the next
> one.
>
> I'll volunteer to manage the release :)
>
> There are 32 open tickets that are still intended to be included in 1.15.0
> release [1].
> What do you guys think which tickets do we want to include and what time
> will it take?
> If there are any other issues on which work is in progress, that you feel
> we *must* include in the release, please post in reply to this thread.
>
> Based on your input we'll define release cut off date.
>
> [1]
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_issues_-3Fjql-3Dproject-2520-253D-2520DRILL-2520AND-2520status-2520in-2520-28Open-252C-2520-2522In-2520Progress-2522-252C-2520Reopened-252C-2520Reviewable-252C-2520Accepted-29-2520AND-2520fixVersion-2520-253D-25201.15.0-2520AND-2520-28component-2520-21-253D-2520Documentation-2520OR-2520component-2520is-2520null-29-2520-2520AND-2520-28labels-2520-21-253D-2520ready-2Dto-2Dcommit-2520OR-2520labels-2520is-2520null-29-2520ORDER-2520BY-2520status-2520DESC-252C-2520updated-2520DESC&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=HlugibuI4IVjs-VMnFvNTcaBtEaDDqE4Ya96cugWqJ8&m=akL-98gphoiivbhujdlv6Tf87UzIw9APJX71G9BDSFw&s=A8aoDFlrsbwTNj6NZPNVHYYMIyqVmfKEeDSM0Jb_ZZc&e=
>
> Kind regards
> Vitalii
>


Re: Handling schema change in blocking operators

2018-11-06 Thread Paul Rogers
HI Aman,

I would completely agree with the analysis -- except for the fact that we can't 
create a general solution, only a patchwork of incomplete ad-hoc solutions. The 
question is not whether it would be useful to have a general solution (it 
would), rather whether it is technically possible without some help from the 
user (it is not, IMHO.)

I like the scenario presented, gives us a concrete example. Let's say an IoT 
device produced files with an evolving schema. A field in a JSON file started 
as BIGINT, later because DOUBLE, and finally became VARCHAR. What should Drill 
do? Maybe the values are:
1
1.1
1.33

The change of types might represent the idea that the above are money amounts, 
and the only way to represent values exactly is with a string (in JSON) and 
with a DECIMAL in Drill.

Or, maybe the values are:
1
1.1
1.1rev3

Which showed that the value is a version string. Early developers thought to 
use an integer, later they wanted minor versions, and even later they realized 
they needed a patch value. The correct value type is VARCHAR.

Once can also invent a scenario in which the proper type is BIGINT, DOUBLE or 
even TIMESTAMP.

Since Drill can't know the user's intention, we can invest quite a bit of 
effort and still not solve the problem.

What is the alternative?

Suppose we simply let the query fail when we see a schema change, but we point 
the user to a solution:

Query failed: Schema conflict on column `foo`: BIGINT and DOUBLE.
Use a schema file to resolve the ambiguity.
See http://drill.apache.org/docs/schema-file for more information.

Now, the user is in control: we stated what we can and cannot do and gave the 
user the option to decide on the data type.

This is a special case of other use cases: it works just as well for specifying 
CSV types, refining JSON types and so on. A single solution that solves 
multiple problems.

This approach also solves the problem that the JDBC and ODBC clients can't 
handle a schema that changes during processing. (The native Drill client can, 
which is a rather cool feature. xDBC hasn't caught up, so we have to deal with 
them as they are.)

In fact, Drill could then say: if your data is nice and clean, query it without 
a schema since the data speaks for itself. If, however, your data is messy (as 
real-word data tends to be), just provide a schema to explain the intent and 
Drill will do the right thing.

And, again, if the team tried the schema solution first, you'd be in a much 
better position to see what additional benefits could be had by trying to guess 
the type (and solving the time-travel issue.) (This is the lazy approach: do 
the least amount of work...)

In fact, it may turn out that schema change as an issue disappears once users 
have a nice, clean, well-understood solution -- the schema file.

Thanks,
- Paul

 

On Tuesday, November 6, 2018, 9:41:21 AM PST, Aman Sinha 
 wrote:  
 
 Hi Paul,
Thanks for the feedback !  I am in complete favor of doing the schema
discovery and schema hinting.  But even on this list in the past we have
discussed other use cases such as IoT devices where the schema-on-read is
needed (I think it was in the context of the 'death of schema-on-read'
email thread).  As I mentioned in my prior email, JSON document databases
don't have pre-defined schema and even if one does schema discovery, it
will have to be continuously updated given that these DBs are used in
operational applications where data is streaming in at a fast rate.

I think we should try for a complementary approach - wherever schema
discovery or hinting is feasible, Drill would use it.  For others
scenarios, can we do a best effort and not fail the query ?

Note that I don't want to backtrack and revise the data types of the rows
already sent to the client.  In fact, today, if you have 2 files with
different schema, if the columns are projected as below, the query will
return data to the client in separate batches.  In fact, this is common
among Drill users to do data exploration (with a LIMIT clause).
(file 1: {a: 10, b: 20.5}  file 2: {a: "cat", b: "dog"} )

0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;

*+--+---+*

*| ** a  ** | **  b  ** |*

*+--+---+*

*| *10  * | *20.5 * |*

*| *cat * | *dog  * |*

*+--+---+*

You mention 'drill cant' predict the future'..which is true and I am saying
we don't need to predict the future.  If all operators did what the Scan
readers do which is emit a new record batch when it encounters a new
schema, then conceptually it would get us much farther along.

The point is : let's assume the client side is able to handle 2 different
schemas, how can Drill internally handle that in the execution plan ?  For
the non-blocking operators it means that as soon as the schema changes, it
emits the previous Record Batch and starts a new output batch.  For the
blocking operators,  there's more things to take care of and I created
DRILL-6829 

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-06 Thread Paul Rogers
Hi All,

Stats would be a great addition. Here are a couple of issues that came up in 
the earlier code review, revisited in light of recent proposed work.

First, the code to gather the stats is rather complex; it is the evolution of 
some work an intern did way back when. We'd be advised to find a simpler 
implementation, ideally one that uses mechanisms we already have.

Second, at present, we have no good story for storing the stats. The file-based 
approach is similar to that used for Parquet metadata, and there are many known 
concurrency issues with that approach -- it is not something to emulate.

One possible approach is to convert metadata gathering to a plain old query. 
That is, rather than having a special mechanism to gather stats, just add 
functions in Drill. Maybe we want NDV and a histogram. (Can't recall all the 
stats that Guatam implemented.) Just implement them as new functions:

SELECT ndv(foo), histogram(foo, 10), ndv(bar), histogram(bar, 10) FROM myTable;

The above would simply display the stats (with the histogram presented as a 
Drill array with 10 buckets.)

Such an approach could build on the aggression mechanism that already exists, 
and would avoid the use of the complex map structure in the current PR. It 
would also give QA and users an easy way to check the stats values.

Later, when the file problem is solved, or the metastore is available, some 
process can kick off a query of the appropriate form an write the results to 
the metastore in a concurrency-safe way. And, a COMPUTE STATS command would 
just be a wrapper around the above query along with writing the stats to some 
location.

Just my two cents...

Thanks,
- Paul

 

On Tuesday, November 6, 2018, 2:51:35 AM PST, Vitalii Diravka 
 wrote:  
 
 +1
It will help to rely on that code in the process of implementing Drill
Metastore, DRILL-6552.

@Gautam Please address all current commits and rebase onto latest master,
then Vova and me will do additional review for it.
Just for clarification, am I right, the changes state is the same as in
last comment in DRILL-1328 [1]
(will not include histograms and will cause some regressions for TPC-H and
TPC-DS benchmarks)?

[1]
https://issues.apache.org/jira/browse/DRILL-1328?focusedCommentId=16061374&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16061374


Kind regards
Vitalii


On Tue, Nov 6, 2018 at 1:47 AM Parth Chandra  wrote:

> +1
> I'd say go for it.
> If the option to use enhanced stats an be turned on per session, then users
> can experiment and choose to turn it on for queries where they do not
> experience performance degradation.
>
>
> On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai  wrote:
>
> > Hi all,
> >
> > I had an initial implementation for statistics support for Drill
> > [DRILL-1328] . This
> JIRA
> > has links to the design spec as well as the PR. Unfortunately, because of
> > some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> > temporarily shelve the implementation. I would like to resolve the
> pending
> > issues and get the changes in.
> >
> > Hopefully, it will be okay to merge it in as an experimental feature
> since
> > in order to resolve these issues we may need to change the existing join
> > ordering algorithm in Drill, add support for Histograms and a few other
> > planning related issues. Moreover, the community is adding a meta-store
> for
> > Drill [DRILL-6552] .
> > Statistics should also be able to leverage the brand new meta-store
> instead
> > of/in addition to having a custom store implementation.
> >
> > My plan is to address the most critical review comments and get the
> initial
> > version in as an experimental feature. Some other good-to-have aspects
> like
> > handling schema changes during the statistics collection process maybe
> > deferred to the next iteration. Subsequently, I will improve these
> > good-to-have features and additional performance improvements. It would
> be
> > great to get the initial implementation in to avoid the rebase issues and
> > allow other community members to use and contribute to the feature.
> >
> > Please take a look at the design doc and the PR and provide suggestions
> and
> > feedback on the JIRA. Also I will try to present the current state of
> > statistics and the feature in one of the bi-weekly Drill Community
> > Hangouts.
> >
> > Thanks,
> > Gautam
> >
>
  

Re: [DISCUSS] 1.15.0 release

2018-11-06 Thread Khurram Faraaz
Hi Vitalii

We should investigate and fix this issue.
https://issues.apache.org/jira/browse/DRILL-6816

Thanks,
Khurram

On Tue, Nov 6, 2018 at 9:46 AM Vitalii Diravka  wrote:

> Hi Drillers,
>
> It's been 3 months since the last release and it is time to do the next
> one.
>
> I'll volunteer to manage the release :)
>
> There are 32 open tickets that are still intended to be included in 1.15.0
> release [1].
> What do you guys think which tickets do we want to include and what time
> will it take?
> If there are any other issues on which work is in progress, that you feel
> we *must* include in the release, please post in reply to this thread.
>
> Based on your input we'll define release cut off date.
>
> [1]
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_issues_-3Fjql-3Dproject-2520-253D-2520DRILL-2520AND-2520status-2520in-2520-28Open-252C-2520-2522In-2520Progress-2522-252C-2520Reopened-252C-2520Reviewable-252C-2520Accepted-29-2520AND-2520fixVersion-2520-253D-25201.15.0-2520AND-2520-28component-2520-21-253D-2520Documentation-2520OR-2520component-2520is-2520null-29-2520-2520AND-2520-28labels-2520-21-253D-2520ready-2Dto-2Dcommit-2520OR-2520labels-2520is-2520null-29-2520ORDER-2520BY-2520status-2520DESC-252C-2520updated-2520DESC&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=H5JEl9vb-mBIjic10QAbDD2vkUUKAxjO6wZO322RtdI&m=Y3YHmNkTAlAyrEa41_zIzzO0Zar0B7i9XwXs2aBEIKc&s=nZrfAS48g1mSos0XSURcgZ2Btz1TfV0GQLj8Wpob-Do&e=
>
> Kind regards
> Vitalii
>


[GitHub] dvjyothsna opened a new pull request #1526: Md 4946: Added Developer Day and User Meet-up links

2018-11-06 Thread GitBox
dvjyothsna opened a new pull request #1526: Md 4946: Added Developer Day and 
User Meet-up links
URL: https://github.com/apache/drill/pull/1526
 
 
   @aashreya , @bbevens Please review


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format reader

2018-11-06 Thread GitBox
jcmcote commented on a change in pull request #1500: DRILL-6820: Msgpack format 
reader
URL: https://github.com/apache/drill/pull/1500#discussion_r231264607
 
 

 ##
 File path: 
contrib/codec-zstd/src/main/java/org/apache/hadoop/io/compress/zstd/ZstdDecompressor.java
 ##
 @@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.io.compress.zstd;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.io.compress.Decompressor;
+
+import com.github.luben.zstd.ZstdDirectBufferDecompressingStream;
+import com.github.luben.zstd.util.Native;
+
+/**
+ * A {@link Decompressor} based on the zstandard compression algorithm.
+ */
+public class ZstdDecompressor implements Decompressor {
+  private static final Log LOG = 
LogFactory.getLog(ZstdDecompressor.class.getName());
 
 Review comment:
   These classes are an implementation of the Hadoop codec. For example when 
drill reads gz files it is relying on the Hadoop library to handle gzip 
decompression.
   
   The way you contribute a new codec is via the Java Service Plugin mechanism. 
That is META-INF/services folder.
   
   I've implemented the interface 
org.apache.hadoop.io.compress.CompressionCodec using the zstandard java library.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: [DISCUSS] 1.15.0 release

2018-11-06 Thread Charles Givre
Hi Vitalii, 
I have a JIRA which I’ve been working on for a format plugin for Syslog 
formatted data.  It’s basically done and if I can get the PR submitted in the 
nexf few days or so, could we try to get that into 1.15?

> On Nov 6, 2018, at 12:45, Vitalii Diravka  wrote:
> 
> Hi Drillers,
> 
> It's been 3 months since the last release and it is time to do the next one.
> 
> I'll volunteer to manage the release :)
> 
> There are 32 open tickets that are still intended to be included in 1.15.0
> release [1].
> What do you guys think which tickets do we want to include and what time
> will it take?
> If there are any other issues on which work is in progress, that you feel
> we *must* include in the release, please post in reply to this thread.
> 
> Based on your input we'll define release cut off date.
> 
> [1]
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20DRILL%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20Reviewable%2C%20Accepted)%20AND%20fixVersion%20%3D%201.15.0%20AND%20(component%20!%3D%20Documentation%20OR%20component%20is%20null)%20%20AND%20(labels%20!%3D%20ready-to-commit%20OR%20labels%20is%20null)%20ORDER%20BY%20status%20DESC%2C%20updated%20DESC
> 
> Kind regards
> Vitalii



[DISCUSS] 1.15.0 release

2018-11-06 Thread Vitalii Diravka
Hi Drillers,

It's been 3 months since the last release and it is time to do the next one.

I'll volunteer to manage the release :)

There are 32 open tickets that are still intended to be included in 1.15.0
release [1].
What do you guys think which tickets do we want to include and what time
will it take?
If there are any other issues on which work is in progress, that you feel
we *must* include in the release, please post in reply to this thread.

Based on your input we'll define release cut off date.

[1]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20DRILL%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20Reviewable%2C%20Accepted)%20AND%20fixVersion%20%3D%201.15.0%20AND%20(component%20!%3D%20Documentation%20OR%20component%20is%20null)%20%20AND%20(labels%20!%3D%20ready-to-commit%20OR%20labels%20is%20null)%20ORDER%20BY%20status%20DESC%2C%20updated%20DESC

Kind regards
Vitalii


Re: Handling schema change in blocking operators

2018-11-06 Thread Aman Sinha
Hi Paul,
Thanks for the feedback !  I am in complete favor of doing the schema
discovery and schema hinting.  But even on this list in the past we have
discussed other use cases such as IoT devices where the schema-on-read is
needed (I think it was in the context of the 'death of schema-on-read'
email thread).   As I mentioned in my prior email, JSON document databases
don't have pre-defined schema and even if one does schema discovery, it
will have to be continuously updated given that these DBs are used in
operational applications where data is streaming in at a fast rate.

I think we should try for a complementary approach - wherever schema
discovery or hinting is feasible, Drill would use it.  For others
scenarios, can we do a best effort and not fail the query ?

Note that I don't want to backtrack and revise the data types of the rows
already sent to the client.  In fact, today, if you have 2 files with
different schema, if the columns are projected as below, the query will
return data to the client in separate batches.   In fact, this is common
among Drill users to do data exploration (with a LIMIT clause).
(file 1: {a: 10, b: 20.5}   file 2: {a: "cat", b: "dog"} )

0: jdbc:drill:zk=local> select a, b from dfs.`/tmp/table2` ;

*+--+---+*

*| ** a  ** | **  b  ** |*

*+--+---+*

*| *10  * | *20.5 * |*

*| *cat * | *dog  * |*

*+--+---+*

You mention 'drill cant' predict the future'..which is true and I am saying
we don't need to predict the future.  If all operators did what the Scan
readers do which is emit a new record batch when it encounters a new
schema, then conceptually it would get us much farther along.

The point is : let's assume the client side is able to handle 2 different
schemas, how can Drill internally handle that in the execution plan ?   For
the non-blocking operators it means that as soon as the schema changes, it
emits the previous Record Batch and starts a new output batch.  For the
blocking operators,  there's more things to take care of and I created
DRILL-6829   to capture
that.

Aman

On Mon, Nov 5, 2018 at 8:50 PM Paul Rogers 
wrote:

> Hi Aman,
>
> Thanks much for the write-up. My two cents, FWIW.
>
> As the history of this list has shown, I've fought with the schema change
> issue multiple times: in sort, in JSON, in the row set loader framework,
> and in writing the "Data Engineering" chapter in the Learning Drill book.
>
> What I have come to realize is that there is no general solution to the
> schema change problem. Yes, there are clever things to do in special cases.
> But he general problem is unsolvable.
>
> Look at the open PR for the projection framework. There is an
> implementation of a "schema smoother." It tries really hard, but it
> highlights the inherent limitations of such an effort.
>
> The key reason is that, do do a good job, rows processed now must know the
> types of rows seen 100 million rows from now. Since Drill does not have a
> time machine, that is not possible.
>
> The easiest way to visualize this is with a single fragment that reads two
> files. File A has 100K rows with column C as an Varchar. File B has 100K
> rows with column C as an Int. There is no sort, so all rows are returned
> directly to the client as, say, four 50K batches.
>
> The client will encounter a schema with C as Varchar. Later, it will C as
> Int. But, since the client already told the JDBC consumer that the type is
> Varchar, the JDBC client is stuck. It could convert the Int to Varchar
> behind the scenes.
>
> Now, run the query again. The order in which Drill reads files is random.
> Second time, the client sees C as an Int. Now, JDBC must convert the later
> Varchar columns to Int. That works if the Varchar are numbers, but not if
> the Ints should have been Varchar.
>
> The general problem as I put it in the book, is that "Drill can't predict
> the future" but that is precisely what is needed for a general solution.
>
> However, if the user sets a policy (treat column C as a DECIMAL, even if
> you read it as an Int or Varchar), then time travel is not necessary.
>
> My humble suggestion is to focus on the schema effort: give the user a way
> to define the resolution to the issue that is right for their data. See how
> that works out for users. Then, with that extra information, go back and
> see what other features might be useful.
>
> The proposed schema support (at least as hints, preferably as a schema
> file, full blown with a metastore) is a much better, easier to understand,
> easier to explain solution that is familiar to anyone coming from a DB
> background.
>
>
> My suggestion: to understand the challenges and limitations, think through
> many different scenarios: look at the history of this list for some, see
> the notes in the Result Set Loader wiki and code for more. Work out how
> they could be resolved. You may see something I've missed, or you may
> realize that the problem is j

[GitHub] gfilipiak commented on issue #1525: DRILL-6831: Adding information about possible authentication settings in connection URL for mongodb

2018-11-06 Thread GitBox
gfilipiak commented on issue #1525: DRILL-6831: Adding information about 
possible authentication settings in connection URL for mongodb
URL: https://github.com/apache/drill/pull/1525#issuecomment-436320863
 
 
   @vdiravka done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-6831) Adding information about possible authentication settings in connection URL for mongodb

2018-11-06 Thread Gabriel Filipiak (JIRA)
Gabriel Filipiak created DRILL-6831:
---

 Summary: Adding information about possible authentication settings 
in connection URL for mongodb
 Key: DRILL-6831
 URL: https://issues.apache.org/jira/browse/DRILL-6831
 Project: Apache Drill
  Issue Type: Improvement
  Components: Documentation
Reporter: Gabriel Filipiak


Adding information about possible authentication settings in connection URL for 
mongodb



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] vdiravka commented on issue #1525: Update 090-mongodb-storage-plugin.md

2018-11-06 Thread GitBox
vdiravka commented on issue #1525: Update 090-mongodb-storage-plugin.md
URL: https://github.com/apache/drill/pull/1525#issuecomment-436317464
 
 
   Thanks for the contribution, @gfilipiak
   Could you also create a JIRA and add it to the title of the PR similar to 
this one #1469


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] gfilipiak opened a new pull request #1525: Update 090-mongodb-storage-plugin.md

2018-11-06 Thread GitBox
gfilipiak opened a new pull request #1525: Update 090-mongodb-storage-plugin.md
URL: https://github.com/apache/drill/pull/1525
 
 
   Adding information about possible authentication settings in connection URL 
for mongodb


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-06 Thread Vitalii Diravka
+1
It will help to rely on that code in the process of implementing Drill
Metastore, DRILL-6552.

@Gautam Please address all current commits and rebase onto latest master,
then Vova and me will do additional review for it.
Just for clarification, am I right, the changes state is the same as in
last comment in DRILL-1328 [1]
(will not include histograms and will cause some regressions for TPC-H and
TPC-DS benchmarks)?

[1]
https://issues.apache.org/jira/browse/DRILL-1328?focusedCommentId=16061374&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16061374


Kind regards
Vitalii


On Tue, Nov 6, 2018 at 1:47 AM Parth Chandra  wrote:

> +1
> I'd say go for it.
> If the option to use enhanced stats an be turned on per session, then users
> can experiment and choose to turn it on for queries where they do not
> experience performance degradation.
>
>
> On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai  wrote:
>
> > Hi all,
> >
> > I had an initial implementation for statistics support for Drill
> > [DRILL-1328] . This
> JIRA
> > has links to the design spec as well as the PR. Unfortunately, because of
> > some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> > temporarily shelve the implementation. I would like to resolve the
> pending
> > issues and get the changes in.
> >
> > Hopefully, it will be okay to merge it in as an experimental feature
> since
> > in order to resolve these issues we may need to change the existing join
> > ordering algorithm in Drill, add support for Histograms and a few other
> > planning related issues. Moreover, the community is adding a meta-store
> for
> > Drill [DRILL-6552] .
> > Statistics should also be able to leverage the brand new meta-store
> instead
> > of/in addition to having a custom store implementation.
> >
> > My plan is to address the most critical review comments and get the
> initial
> > version in as an experimental feature. Some other good-to-have aspects
> like
> > handling schema changes during the statistics collection process maybe
> > deferred to the next iteration. Subsequently, I will improve these
> > good-to-have features and additional performance improvements. It would
> be
> > great to get the initial implementation in to avoid the rebase issues and
> > allow other community members to use and contribute to the feature.
> >
> > Please take a look at the design doc and the PR and provide suggestions
> and
> > feedback on the JIRA. Also I will try to present the current state of
> > statistics and the feature in one of the bi-weekly Drill Community
> > Hangouts.
> >
> > Thanks,
> > Gautam
> >
>


[GitHub] oleg-zinovev edited a comment on issue #1446: DRILL-6349: Drill JDBC driver fails on Java 1.9+ with NoClassDefFoundError: sun/misc/VM

2018-11-06 Thread GitBox
oleg-zinovev edited a comment on issue #1446: DRILL-6349: Drill JDBC driver 
fails on Java 1.9+ with NoClassDefFoundError: sun/misc/VM
URL: https://github.com/apache/drill/pull/1446#issuecomment-436205818
 
 
   @vvysotskyi review, please.
   Also, can you validate the `sqlline.bat` on Windows? I have no Windows PC 
available


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] oleg-zinovev commented on issue #1446: DRILL-6349: Drill JDBC driver fails on Java 1.9+ with NoClassDefFoundError: sun/misc/VM

2018-11-06 Thread GitBox
oleg-zinovev commented on issue #1446: DRILL-6349: Drill JDBC driver fails on 
Java 1.9+ with NoClassDefFoundError: sun/misc/VM
URL: https://github.com/apache/drill/pull/1446#issuecomment-436205818
 
 
   @vvysotskyi review, please.
   Also, can you validate the sqlline.bat on Windows? I have no Windows PC 
avaliable


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] oleg-zinovev commented on issue #1446: DRILL-6349: Drill JDBC driver fails on Java 1.9+ with NoClassDefFoundError: sun/misc/VM

2018-11-06 Thread GitBox
oleg-zinovev commented on issue #1446: DRILL-6349: Drill JDBC driver fails on 
Java 1.9+ with NoClassDefFoundError: sun/misc/VM
URL: https://github.com/apache/drill/pull/1446#issuecomment-436203549
 
 
   @vvysotskyi 
   >  Let's set false for 
maven-surefire-plugin
   
   Look's like this problem occurs only in Debian OpenJDK build: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=911925
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] lushuifeng commented on a change in pull request #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use

2018-11-06 Thread GitBox
lushuifeng commented on a change in pull request #1524: DRILL-6830: Remove 
Hook.REL_BUILDER_SIMPLIFY handler after use
URL: https://github.com/apache/drill/pull/1524#discussion_r231061174
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/SqlConverter.java
 ##
 @@ -377,11 +377,15 @@ public RelRoot toRel(final SqlNode validatedNode) {
  * during creating new projects since it may cause changing data mode
  * which causes to assertion errors during type validation
  */
-Hook.REL_BUILDER_SIMPLIFY.add(Hook.propertyJ(false));
+Hook.Closeable closeable = 
Hook.REL_BUILDER_SIMPLIFY.add(Hook.propertyJ(false));
 
-//To avoid unexpected column errors set a value of top to false
-final RelRoot rel = sqlToRelConverter.convertQuery(validatedNode, false, 
false);
-return rel.withRel(sqlToRelConverter.flattenTypes(rel.rel, true));
+try {
+  //To avoid unexpected column errors set a value of top to false
+  final RelRoot rel = sqlToRelConverter.convertQuery(validatedNode, false, 
false);
+  return rel.withRel(sqlToRelConverter.flattenTypes(rel.rel, true));
+} finally {
+  closeable.close();
 
 Review comment:
   fixed and squashed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] vvysotskyi commented on issue #1446: DRILL-6349: Drill JDBC driver fails on Java 1.9+ with NoClassDefFoundError: sun/misc/VM

2018-11-06 Thread GitBox
vvysotskyi commented on issue #1446: DRILL-6349: Drill JDBC driver fails on 
Java 1.9+ with NoClassDefFoundError: sun/misc/VM
URL: https://github.com/apache/drill/pull/1446#issuecomment-436195360
 
 
   @oleg-zinovev, now `javassist 3.24.0-GA` available from the maven central 
repo: https://mvnrepository.com/artifact/org.javassist/javassist/3.24.0-GA


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] vvysotskyi commented on a change in pull request #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use

2018-11-06 Thread GitBox
vvysotskyi commented on a change in pull request #1524: DRILL-6830: Remove 
Hook.REL_BUILDER_SIMPLIFY handler after use
URL: https://github.com/apache/drill/pull/1524#discussion_r231056622
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/SqlConverter.java
 ##
 @@ -377,11 +377,15 @@ public RelRoot toRel(final SqlNode validatedNode) {
  * during creating new projects since it may cause changing data mode
  * which causes to assertion errors during type validation
  */
-Hook.REL_BUILDER_SIMPLIFY.add(Hook.propertyJ(false));
+Hook.Closeable closeable = 
Hook.REL_BUILDER_SIMPLIFY.add(Hook.propertyJ(false));
 
-//To avoid unexpected column errors set a value of top to false
-final RelRoot rel = sqlToRelConverter.convertQuery(validatedNode, false, 
false);
-return rel.withRel(sqlToRelConverter.flattenTypes(rel.rel, true));
+try {
+  //To avoid unexpected column errors set a value of top to false
+  final RelRoot rel = sqlToRelConverter.convertQuery(validatedNode, false, 
false);
+  return rel.withRel(sqlToRelConverter.flattenTypes(rel.rel, true));
+} finally {
+  closeable.close();
 
 Review comment:
   Could you please replace it with `try with resources` block?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] lushuifeng edited a comment on issue #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use

2018-11-06 Thread GitBox
lushuifeng edited a comment on issue #1524: DRILL-6830: Remove 
Hook.REL_BUILDER_SIMPLIFY handler after use
URL: https://github.com/apache/drill/pull/1524#issuecomment-436191071
 
 
   @vvysotskyi Could you please review this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] lushuifeng commented on issue #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use

2018-11-06 Thread GitBox
lushuifeng commented on issue #1524: DRILL-6830: Remove 
Hook.REL_BUILDER_SIMPLIFY handler after use
URL: https://github.com/apache/drill/pull/1524#issuecomment-436191071
 
 
   @vvysotskyi Could please review this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] lushuifeng opened a new pull request #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use

2018-11-06 Thread GitBox
lushuifeng opened a new pull request #1524: DRILL-6830: Remove 
Hook.REL_BUILDER_SIMPLIFY handler after use
URL: https://github.com/apache/drill/pull/1524
 
 
   Please see 
[DRILL-6830](https://issues.apache.org/jira/projects/DRILL/issues/DRILL-6830?filter=allopenissues&orderby=created+DESC%2C+priority+DESC%2C+updated+DESC).
   Other Hook handlers has been checked, No such case found.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (DRILL-6830) Hook.REL_BUILDER_SIMPLIFY handler didn't removed cause performance degression

2018-11-06 Thread shuifeng lu (JIRA)
shuifeng lu created DRILL-6830:
--

 Summary: Hook.REL_BUILDER_SIMPLIFY handler didn't removed cause 
performance degression
 Key: DRILL-6830
 URL: https://issues.apache.org/jira/browse/DRILL-6830
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Affects Versions: 1.14.0
Reporter: shuifeng lu
Assignee: shuifeng lu
 Attachments: Screen Shot 2018-11-06 at 16.14.16.png

Planning performance degression has been observed that the duration of planning 
increased from 30ms to 160ms after running drill a long period of time(say a 
month).

RelBuilder.simplify never becomes true if Hook.REL_BUILDER_SIMPLIFY handlers 
are not removed.

Here is some clue (after running 40 days):

Hook.get takes 8ms per-invocation, it may be called serveral times per query.
 ---[8.816063ms] org.apache.calcite.tools.RelBuilder:()
   +---[0.020218ms] java.util.ArrayDeque:()
   +---[0.018493ms] java.lang.Boolean:valueOf()
   +---[8.341566ms] org.apache.calcite.runtime.Hook:get()
   +---[0.008489ms] java.lang.Boolean:booleanValue()
   +---[min=5.21E-4ms,max=0.015832ms,total=0.025233ms,count=12] 
org.apache.calcite.plan.Context:unwrap()
   +---[min=3.83E-4ms,max=0.009494ms,total=0.014516ms,count=13] 
org.apache.calcite.util.Util:first()
   +---[0.006892ms] org.apache.calcite.plan.RelOptCluster:getPlanner()
   +---[0.009104ms] org.apache.calcite.plan.RelOptPlanner:getExecutor()
   +---[min=4.8E-4ms,max=0.002277ms,total=0.002757ms,count=2] 
org.apache.calcite.plan.RelOptCluster:getRexBuilder()
   ---[min=4.91E-4ms,max=0.004586ms,total=0.005077ms,count=2] 
org.apache.calcite.rex.RexSimplify:()

The top instances in JVM
num   #instances     #bytes           class name
--
 1:       116333      116250440     [B
 2:       890126      105084536    [C
 3:       338062        37415944    [Ljava.lang.Object;
 4:     1715004        27440064    org.apache.calcite.runtime.Hook$4
 5:      803909         19293816    java.lang.String

!Screen Shot 2018-11-06 at 16.14.16.png!  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] oleg-zinovev commented on a change in pull request #1450: DRILL-6717: lower and upper functions not works with national characters

2018-11-06 Thread GitBox
oleg-zinovev commented on a change in pull request #1450: DRILL-6717: lower and 
upper functions not works with national characters
URL: https://github.com/apache/drill/pull/1450#discussion_r231030533
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctions.java
 ##
 @@ -786,10 +784,11 @@ public void setup() {
 
 @Override
 public void eval() {
-  out.buffer = buffer = buffer.reallocIfNeeded(input.end - input.start);
+  byte[] result = 
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.initCap(input.start, 
input.end, input.buffer);
 
 Review comment:
   @paul-rogers review, please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services