date:20190613

Extended Vector Framework ready for use

2019-06-13 Thread Paul Rogers

Hi All,

A previous note explained how Drill has added the "Extended Vector Framework"
(also called the "Row Set Framework") to improve the user's experience with
Drill. On of Drill's key contributions is "schema-on-read": Drill can make
sense of many kinds of data files without the hassle of setting up the Hive
Meta Store (HMS). While Drill can use HMS, but it is often more convenient to
just query a table (directory of files) without first defining a schema in HMS.

The EVF helps to solve two problems that crop up with the schema-on-read
approach:

* Drill does not know the size of the data to be read, yet each reader must
limit record batch sizes to a configured maximum.

* File schemas can be ambiguous, resulting in two scan fragments picking
different column types, which can lead to query failures when Drill tries to
combine the results.

For the user, EVF simply makes Drill work better, especially if they use CREATE
SCHEMA to tell Drill how to resolve schema ambiguities.

To achieve our goals, storage and format plugins must change (or be created) to
use EVF. This is where you come in if you create or maintain plugins.

We've prepared multiple ways for you to learn how to use the EVF:

* The documentation of the CREATE SCHEMA statement. [1]

* The text format plugin now uses EVF. This is, however, not the best example
because the plugin itself is rather complex.

* Chapter 12 of the Learning Apache Drill book explains how to create a format
plugin. It uses the log format plugin as an example. We've converted the log
format plugin to use EVF (pull request pending at the moment.)

* We've created an EVF tutorial that shows how to convert the log plugin to use
EVF. This connects up Chapter 12 of the Drill book with the recent EVF work. [2]

Please use this mailing list to share questions, comments and suggestions as
you tackle your own plugins. Each plugin has its own unique quirks and issues
which we can discuss here.

Thanks,
- Paul

[1] https://drill.apache.org/docs/create-or-replace-schema/

[2]
https://github.com/paul-rogers/drill/wiki/Developer%27s-Guide-to-the-Enhanced-Vector-Framework

[GitHub] [drill] paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") 
plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#issuecomment-501956556
 
 
   Added the ability to specify the regex (and column schema) in the provided 
schema. Defined a table property for the regex. Although we can't (yet) use 
table properties to define the schema, we can now use `CREATE SCHEMA` to define 
both the regex and the schema.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") 
plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#issuecomment-501956415
 
 
   This PR is now failing due to the Protobuf errors. Thanks @vvysotskyi for 
fixing them. I'll rebase onto that fix once it is committed and the reviewers 
have approved the commits.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

paul-rogers commented on a change in pull request #1807: DRILL-7293: Convert 
the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293633490
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java
 ##
 @@ -389,4 +416,56 @@ public void testRawUMNoSchema() throws RpcException {
 
 RowSetUtilities.verify(expected, results);
   }
+
+  @Test
+  public void testProvidedSchema() throws Exception {
 
 Review comment:
   Short answer: it does not seem to work. I tried this in the past and found 
that table functions take only simple values (numbers, strings), not lists. 
Since this plugin uses a list, I never could figure out how to use it with 
table functions. In particular, how would the table function know how to create 
the instance of `LogFormatField` within the list? Am I missing something?
   
   This plugin, in particular, would very much benefit from the use of a table 
function so that the user does not have to define a new plugin config for each 
new file type.
   
   If there is a way to make this work, we can add the test and describe the 
answer in the README file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #1806: DRILL-7292: Remove V1 and V2 text readers

2019-06-13 Thread GitBox

paul-rogers commented on issue #1806: DRILL-7292: Remove V1 and V2 text readers
URL: https://github.com/apache/drill/pull/1806#issuecomment-501926628
 
 
   @arina-ielchiieva, thank you for the review. Please check if the revised 
option messages are clear.
   
   The test failures may be due to bugs fixed in the text reader.  @vvysotskyi, 
check if the files in question try to use the back-slash as the escape 
character. Drill's text readers are configured to use the quote character as an 
escape by default, but the code did not correctly handle this case. If the file 
contains the pattern `\"` or `""`, then the text reader will produce a 
different result than previously.
   
   Otherwise, please do point me to the errors and I'll track down the issues. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on issue #1806: DRILL-7292: Remove V1 and V2 text readers

2019-06-13 Thread GitBox

arina-ielchiieva commented on issue #1806: DRILL-7292: Remove V1 and V2 text 
readers
URL: https://github.com/apache/drill/pull/1806#issuecomment-501769723
 
 
   Issue with protobufs is a bug, will be fixed in 
https://issues.apache.org/jira/browse/DRILL-7294.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on issue #1808: DRILL-7294: Prevent generating java beans using protostuff to avoid overriding classes with the same simple name declared as nested in the p

2019-06-13 Thread GitBox

arina-ielchiieva commented on issue #1808: DRILL-7294: Prevent generating java 
beans using protostuff to avoid overriding classes with the same simple name 
declared as nested in the proto files
URL: https://github.com/apache/drill/pull/1808#issuecomment-501764354
 
 
   +1, thanks for fixing this issue!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] vvysotskyi opened a new pull request #1808: DRILL-7294: Prevent generating java beans using protostuff to avoid overriding classes with the same simple name declared as nested in the

2019-06-13 Thread GitBox

vvysotskyi opened a new pull request #1808: DRILL-7294: Prevent generating java 
beans using protostuff to avoid overriding classes with the same simple name 
declared as nested in the proto files
URL: https://github.com/apache/drill/pull/1808
 
 
   This PR contains two commits:
   - the first commit contains changes produced manually:
   -- removed generating beans using protostuff;
   -- updated protostuff version;
   -- replaced usage of beans generated with protostuff
   - the second commit contains changes caused by regenerating protobufs.
   
   For problem description, please see 
[DRILL-7294](https://issues.apache.org/jira/browse/DRILL-7294).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (DRILL-7294) Prevent generating java beans using protostuff to avoid overriding classes with the same simple name declared as nested in the proto files

2019-06-13 Thread Volodymyr Vysotskyi (JIRA)

Volodymyr Vysotskyi created DRILL-7294:
--

 Summary: Prevent generating java beans using protostuff to avoid 
overriding classes with the same simple name declared as nested in the proto 
files
 Key: DRILL-7294
 URL: https://issues.apache.org/jira/browse/DRILL-7294
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Volodymyr Vysotskyi
Assignee: Volodymyr Vysotskyi
 Fix For: 1.17.0


Currently, {{protostuff-maven-plugin}} generates java-bean classes from proto 
files. But these classes already generated by protobuf, the single difference 
is that they are placed in a different package, and preserved the nesting of 
the classes as they are declared in the proto files.

protostuff creates new files for nested classes, and it causes problems for the 
case when several nested classes have the same name - they override each other, 
for example here is Travis failure caused by this problem: 
https://travis-ci.org/apache/drill/jobs/545013395



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[GitHub] [drill] arina-ielchiieva commented on issue #1806: DRILL-7292: Remove V1 and V2 text readers

2019-06-13 Thread GitBox

arina-ielchiieva commented on issue #1806: DRILL-7292: Remove V1 and V2 text 
readers
URL: https://github.com/apache/drill/pull/1806#issuecomment-501675713
 
 
   Ran the unit tests, Functional and Advanced, there are four expected 
failures on Functional tests (asked @agozhiy to take a look).
   Regarding protobuf job failures, @vvysotskyi will take a look. Mostly likely 
they are not related to your changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293323024
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java
 ##
 @@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.log;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+import org.apache.drill.common.exceptions.UserException;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
+import org.apache.drill.exec.physical.rowSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+
+public class LogBatchReader implements ManagedReader {
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogBatchReader.class);
 
 Review comment:
   No need to use full imports.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293324471
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md
 ##
 @@ -11,26 +18,50 @@ If you wanted to analyze log files such as the MySQL log 
sample shown below usin
 070917 16:29:01  21 Query   select * from location
 070917 16:29:12  21 Query   select * from location where id = 1 LIMIT 1
 ```
-This plugin will allow you to configure Drill to directly query logfiles of 
any configuration.
+
+Using this plugin, you can configure Drill to directly query log files of
+any configuration.
 
 ## Configuration Options
-* **`type`**:  This tells Drill which extension to use.  In this case, it must 
be `logRegex`.  This field is mandatory.
-* **`regex`**:  This is the regular expression which defines how the log file 
lines will be split.  You must enclose the parts of the regex in grouping 
parentheses that you wish to extract.  Note that this plugin uses Java regular 
expressions and requires that shortcuts such as `\d` have an additional slash:  
ie `\\d`.  This field is mandatory.
-* **`extension`**:  This option tells Drill which file extensions should be 
mapped to this configuration.  Note that you can have multiple configurations 
of this plugin to allow you to query various log files.  This field is 
mandatory.
-* **`maxErrors`**:  Log files can be inconsistent and messy.  The `maxErrors` 
variable allows you to set how many errors the reader will ignore before 
halting execution and throwing an error.  Defaults to 10.
-* **`schema`**:  The `schema` field is where you define the structure of the 
log file.  This section is optional.  If you do not define a schema, all fields 
will be assigned a column name of `field_n` where `n` is the index of the 
field. The undefined fields will be assigned a default data type of `VARCHAR`.
+
+* **`type`**:  This tells Drill which extension to use.  In this case, it must
 
 Review comment:
   ```suggestion
   * **`type`**: This tells Drill which extension to use. In this case, it must
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293324364
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md
 ##
 @@ -1,8 +1,15 @@
 # Drill Regex/Logfile Plugin
-Plugin for Apache Drill that allows Drill to read and query arbitrary files 
where the schema can be defined by a regex.  The original intent was for this 
to be used for log files, however, it can be used for any structured data.
+
+Plugin for Apache Drill that allows Drill to read and query arbitrary files
+where the schema can be defined by a regex.  The original intent was for this
 
 Review comment:
   ```suggestion
   where the schema can be defined by a regex. The original intent was for this
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293326614
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java
 ##
 @@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.log;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+import org.apache.drill.common.exceptions.UserException;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
+import org.apache.drill.exec.physical.rowSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+
+public class LogBatchReader implements ManagedReader {
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogBatchReader.class);
+  public static final String RAW_LINE_COL_NAME = "_raw";
+  public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows";
+
+  private FileSplit split;
+  private final LogFormatConfig formatConfig;
+  private final Pattern pattern;
+  private final TupleMetadata schema;
+  private BufferedReader reader;
+  private int capturingGroups;
+  private ResultSetLoader loader;
+  private ScalarWriter rawColWriter;
+  private ScalarWriter unmatchedColWriter;
+  private boolean saveMatchedRows;
+  private int maxErrors;
+  private int lineNumber;
+  private int errorCount;
+
+  public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, 
TupleMetadata schema) {
+this.formatConfig = formatConfig;
+this.maxErrors = Math.max(0, formatConfig.getMaxErrors());
+this.pattern = pattern;
+this.schema = schema;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+split = negotiator.split();
+setupPattern();
+negotiator.setTableSchema(schema, true);
+loader = negotiator.build();
+bindColumns(loader.writer());
+openFile(negotiator);
+return true;
+  }
+
+  private void setupPattern() {
+try {
+  Matcher m = pattern.matcher("test");
+  capturingGroups = m.groupCount();
+} catch (PatternSyntaxException e) {
+  throw UserException
+  .validationError(e)
+  .message("Failed to parse regex: \"%s\"", formatConfig.getRegex())
+  .build(logger);
+}
+  }
+
+  private void bindColumns(RowSetLoader writer) {
+for (int i = 0; i < capturingGroups; i++) {
+  saveMatchedRows |= writer.scalar(i).isProjected();
+}
+rawColWriter = writer.scalar(RAW_LINE_COL_NAME);
+saveMatchedRows |= rawColWriter.isProjected();
+unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME);
+
+// If no match-case columns are projected, and the unmatched
+// columns is unprojected, then we want to count (matched)
+// rows.
+
+saveMatchedRows |= !unmatchedColWriter.isProjected();
+  }
+
+  private void openFile(FileSchemaNegotiator negotiator) {
+InputStream in;
+try {
+  in = negotiator.fileSystem().open(split.getPath());
+} catch (Exception e) {
+  throw UserException
+  .dataReadError(e)
+  .message("Failed to open open input file: %s", split.getPath())
+  .addContext("User name", negotiator.userName())
+  .build(logger);
+}
+reader = new BufferedReader(new InputStreamReader(in, Charsets.UTF_8));
+  }
+
+  @Override
+  public boolean next() {
+RowSetLoader rowWriter = loader.writer();
+while (! rowWriter.isFull()) {
+  if (! nextLine(rowWriter)) {
+return false;
+  }
+}
+return true;
+  }
+
+  private boolean nextLine(RowSetLoader rowWriter) {
+S

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293323517
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java
 ##
 @@ -18,86 +18,224 @@
 
 package org.apache.drill.exec.store.log;
 
-import java.io.IOException;
-import org.apache.drill.exec.planner.common.DrillStatsTable;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
 import org.apache.drill.common.exceptions.ExecutionSetupException;
-import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.exceptions.UserException;
 import org.apache.drill.common.logical.StoragePluginConfig;
-import org.apache.drill.exec.ops.FragmentContext;
-import org.apache.drill.exec.proto.UserBitShared;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
 import org.apache.drill.exec.server.DrillbitContext;
-import org.apache.drill.exec.store.RecordReader;
-import org.apache.drill.exec.store.RecordWriter;
-import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.server.options.OptionManager;
 import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
-import org.apache.drill.exec.store.dfs.easy.EasyWriter;
-import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 
-import java.util.List;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
-
 public class LogFormatPlugin extends EasyFormatPlugin {
+  public static final String PLUGIN_NAME = "logRegex";
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class);
+
+  private static class LogReaderFactory extends FileReaderFactory {
+private final LogFormatPlugin plugin;
+private final Pattern pattern;
+private final TupleMetadata schema;
+
+public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, 
TupleMetadata schema) {
+  this.plugin = plugin;
+  this.pattern = pattern;
+  this.schema = schema;
+}
 
-  public static final String DEFAULT_NAME = "logRegex";
-  private final LogFormatConfig formatConfig;
+@Override
+public ManagedReader newReader() {
+   return new LogBatchReader(plugin.getConfig(), pattern, schema);
+}
+  }
 
   public LogFormatPlugin(String name, DrillbitContext context,
  Configuration fsConf, StoragePluginConfig 
storageConfig,
  LogFormatConfig formatConfig) {
-super(name, context, fsConf, storageConfig, formatConfig,
-true,  // readable
-false, // writable
-true, // blockSplittable
-true,  // compressible
-Lists.newArrayList(formatConfig.getExtension()),
-DEFAULT_NAME);
-this.formatConfig = formatConfig;
+super(name, easyConfig(fsConf, formatConfig), context, storageConfig, 
formatConfig);
   }
 
-  @Override
-  public RecordReader getRecordReader(FragmentContext context,
-  DrillFileSystem dfs, FileWork fileWork, 
List columns,
-  String userName) throws 
ExecutionSetupException {
-return new LogRecordReader(context, dfs, fileWork,
-columns, userName, formatConfig);
+  private static EasyFormatConfig easyConfig(Configuration fsConf, 
LogFormatConfig pluginConfig) {
+EasyFormatConfig config = new EasyFormatConfig();
+config.readable = true;
+config.writable = false;
+// Should be block splitable, but logic not yet implemented.
+config.blockSplittable = false;
+config.compressible = true;
+config.supportsProjectPushdown = true;
+config.extensions = Lists.newArrayList(pluginConfig.getExtension());
 
 Review comment:
   Is it possible not to use guava list?


This is an automated message from the Apache Git Ser

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293327227
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java
 ##
 @@ -18,86 +18,224 @@
 
 package org.apache.drill.exec.store.log;
 
-import java.io.IOException;
-import org.apache.drill.exec.planner.common.DrillStatsTable;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
 import org.apache.drill.common.exceptions.ExecutionSetupException;
-import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.exceptions.UserException;
 import org.apache.drill.common.logical.StoragePluginConfig;
-import org.apache.drill.exec.ops.FragmentContext;
-import org.apache.drill.exec.proto.UserBitShared;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
 import org.apache.drill.exec.server.DrillbitContext;
-import org.apache.drill.exec.store.RecordReader;
-import org.apache.drill.exec.store.RecordWriter;
-import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.server.options.OptionManager;
 import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
-import org.apache.drill.exec.store.dfs.easy.EasyWriter;
-import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 
-import java.util.List;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
-
 public class LogFormatPlugin extends EasyFormatPlugin {
+  public static final String PLUGIN_NAME = "logRegex";
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class);
+
+  private static class LogReaderFactory extends FileReaderFactory {
+private final LogFormatPlugin plugin;
+private final Pattern pattern;
+private final TupleMetadata schema;
+
+public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, 
TupleMetadata schema) {
+  this.plugin = plugin;
+  this.pattern = pattern;
+  this.schema = schema;
+}
 
-  public static final String DEFAULT_NAME = "logRegex";
-  private final LogFormatConfig formatConfig;
+@Override
+public ManagedReader newReader() {
+   return new LogBatchReader(plugin.getConfig(), pattern, schema);
+}
+  }
 
   public LogFormatPlugin(String name, DrillbitContext context,
  Configuration fsConf, StoragePluginConfig 
storageConfig,
  LogFormatConfig formatConfig) {
-super(name, context, fsConf, storageConfig, formatConfig,
-true,  // readable
-false, // writable
-true, // blockSplittable
-true,  // compressible
-Lists.newArrayList(formatConfig.getExtension()),
-DEFAULT_NAME);
-this.formatConfig = formatConfig;
+super(name, easyConfig(fsConf, formatConfig), context, storageConfig, 
formatConfig);
   }
 
-  @Override
-  public RecordReader getRecordReader(FragmentContext context,
-  DrillFileSystem dfs, FileWork fileWork, 
List columns,
-  String userName) throws 
ExecutionSetupException {
-return new LogRecordReader(context, dfs, fileWork,
-columns, userName, formatConfig);
+  private static EasyFormatConfig easyConfig(Configuration fsConf, 
LogFormatConfig pluginConfig) {
+EasyFormatConfig config = new EasyFormatConfig();
+config.readable = true;
+config.writable = false;
+// Should be block splitable, but logic not yet implemented.
+config.blockSplittable = false;
+config.compressible = true;
+config.supportsProjectPushdown = true;
+config.extensions = Lists.newArrayList(pluginConfig.getExtension());
+config.fsConf = fsConf;
+config.defaultName = PLUGIN_NAME;
+config.readerOperatorType = CoreOperatorType.REGEX_SUB_SCAN_VALUE;
+config.useEnhancedScan = true;
+

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r29332
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java
 ##
 @@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.log;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+import org.apache.drill.common.exceptions.UserException;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
+import org.apache.drill.exec.physical.rowSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+
+public class LogBatchReader implements ManagedReader {
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogBatchReader.class);
+  public static final String RAW_LINE_COL_NAME = "_raw";
+  public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows";
+
+  private FileSplit split;
+  private final LogFormatConfig formatConfig;
+  private final Pattern pattern;
+  private final TupleMetadata schema;
+  private BufferedReader reader;
+  private int capturingGroups;
+  private ResultSetLoader loader;
+  private ScalarWriter rawColWriter;
+  private ScalarWriter unmatchedColWriter;
+  private boolean saveMatchedRows;
+  private int maxErrors;
+  private int lineNumber;
+  private int errorCount;
+
+  public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, 
TupleMetadata schema) {
+this.formatConfig = formatConfig;
+this.maxErrors = Math.max(0, formatConfig.getMaxErrors());
+this.pattern = pattern;
+this.schema = schema;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+split = negotiator.split();
+setupPattern();
+negotiator.setTableSchema(schema, true);
+loader = negotiator.build();
+bindColumns(loader.writer());
+openFile(negotiator);
+return true;
+  }
+
+  private void setupPattern() {
+try {
+  Matcher m = pattern.matcher("test");
+  capturingGroups = m.groupCount();
+} catch (PatternSyntaxException e) {
+  throw UserException
+  .validationError(e)
+  .message("Failed to parse regex: \"%s\"", formatConfig.getRegex())
+  .build(logger);
+}
+  }
+
+  private void bindColumns(RowSetLoader writer) {
+for (int i = 0; i < capturingGroups; i++) {
+  saveMatchedRows |= writer.scalar(i).isProjected();
+}
+rawColWriter = writer.scalar(RAW_LINE_COL_NAME);
+saveMatchedRows |= rawColWriter.isProjected();
+unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME);
+
+// If no match-case columns are projected, and the unmatched
+// columns is unprojected, then we want to count (matched)
+// rows.
+
+saveMatchedRows |= !unmatchedColWriter.isProjected();
+  }
+
+  private void openFile(FileSchemaNegotiator negotiator) {
+InputStream in;
+try {
+  in = negotiator.fileSystem().open(split.getPath());
+} catch (Exception e) {
+  throw UserException
+  .dataReadError(e)
+  .message("Failed to open open input file: %s", split.getPath())
+  .addContext("User name", negotiator.userName())
+  .build(logger);
+}
+reader = new BufferedReader(new InputStreamReader(in, Charsets.UTF_8));
+  }
+
+  @Override
+  public boolean next() {
+RowSetLoader rowWriter = loader.writer();
+while (! rowWriter.isFull()) {
+  if (! nextLine(rowWriter)) {
+return false;
+  }
+}
+return true;
+  }
+
+  private boolean nextLine(RowSetLoader rowWriter) {
+S

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293325049
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md
 ##
 @@ -11,26 +18,50 @@ If you wanted to analyze log files such as the MySQL log 
sample shown below usin
 070917 16:29:01  21 Query   select * from location
 070917 16:29:12  21 Query   select * from location where id = 1 LIMIT 1
 ```
-This plugin will allow you to configure Drill to directly query logfiles of 
any configuration.
+
+Using this plugin, you can configure Drill to directly query log files of
+any configuration.
 
 ## Configuration Options
-* **`type`**:  This tells Drill which extension to use.  In this case, it must 
be `logRegex`.  This field is mandatory.
-* **`regex`**:  This is the regular expression which defines how the log file 
lines will be split.  You must enclose the parts of the regex in grouping 
parentheses that you wish to extract.  Note that this plugin uses Java regular 
expressions and requires that shortcuts such as `\d` have an additional slash:  
ie `\\d`.  This field is mandatory.
-* **`extension`**:  This option tells Drill which file extensions should be 
mapped to this configuration.  Note that you can have multiple configurations 
of this plugin to allow you to query various log files.  This field is 
mandatory.
-* **`maxErrors`**:  Log files can be inconsistent and messy.  The `maxErrors` 
variable allows you to set how many errors the reader will ignore before 
halting execution and throwing an error.  Defaults to 10.
-* **`schema`**:  The `schema` field is where you define the structure of the 
log file.  This section is optional.  If you do not define a schema, all fields 
will be assigned a column name of `field_n` where `n` is the index of the 
field. The undefined fields will be assigned a default data type of `VARCHAR`.
+
+* **`type`**:  This tells Drill which extension to use.  In this case, it must
+be `logRegex`.  This field is mandatory.
+* **`regex`**:  This is the regular expression which defines how the log file
+lines will be split.  You must enclose the parts of the regex in grouping
 
 Review comment:
   Looks like everywhere in the doc there are two spaces before sentences 
instead of one. Could you please check and fix?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293326946
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java
 ##
 @@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.log;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+import org.apache.drill.common.exceptions.UserException;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
+import org.apache.drill.exec.physical.rowSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+
+public class LogBatchReader implements ManagedReader {
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogBatchReader.class);
+  public static final String RAW_LINE_COL_NAME = "_raw";
+  public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows";
+
+  private FileSplit split;
+  private final LogFormatConfig formatConfig;
+  private final Pattern pattern;
+  private final TupleMetadata schema;
+  private BufferedReader reader;
+  private int capturingGroups;
+  private ResultSetLoader loader;
+  private ScalarWriter rawColWriter;
+  private ScalarWriter unmatchedColWriter;
+  private boolean saveMatchedRows;
+  private int maxErrors;
+  private int lineNumber;
+  private int errorCount;
+
+  public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, 
TupleMetadata schema) {
+this.formatConfig = formatConfig;
+this.maxErrors = Math.max(0, formatConfig.getMaxErrors());
+this.pattern = pattern;
+this.schema = schema;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+split = negotiator.split();
+setupPattern();
+negotiator.setTableSchema(schema, true);
+loader = negotiator.build();
+bindColumns(loader.writer());
+openFile(negotiator);
+return true;
+  }
+
+  private void setupPattern() {
+try {
+  Matcher m = pattern.matcher("test");
+  capturingGroups = m.groupCount();
+} catch (PatternSyntaxException e) {
+  throw UserException
+  .validationError(e)
+  .message("Failed to parse regex: \"%s\"", formatConfig.getRegex())
+  .build(logger);
+}
+  }
+
+  private void bindColumns(RowSetLoader writer) {
+for (int i = 0; i < capturingGroups; i++) {
+  saveMatchedRows |= writer.scalar(i).isProjected();
+}
+rawColWriter = writer.scalar(RAW_LINE_COL_NAME);
+saveMatchedRows |= rawColWriter.isProjected();
+unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME);
+
+// If no match-case columns are projected, and the unmatched
+// columns is unprojected, then we want to count (matched)
+// rows.
+
+saveMatchedRows |= !unmatchedColWriter.isProjected();
+  }
+
+  private void openFile(FileSchemaNegotiator negotiator) {
+InputStream in;
+try {
+  in = negotiator.fileSystem().open(split.getPath());
+} catch (Exception e) {
+  throw UserException
+  .dataReadError(e)
+  .message("Failed to open open input file: %s", split.getPath())
 
 Review comment:
   ```suggestion
 .message("Failed to open input file: %s", split.getPath())
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293327844
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/log/TestLogReader.java
 ##
 @@ -389,4 +416,56 @@ public void testRawUMNoSchema() throws RpcException {
 
 RowSetUtilities.verify(expected, results);
   }
+
+  @Test
+  public void testProvidedSchema() throws Exception {
 
 Review comment:
   Could you please add unit tests to check how this format plugin works with 
schema parameter in table function?
   Example: `org.apache.drill.TestSchemaWithTableFunction`
   We might need to check two cases:
   `select * from table(t(schema=>'inline=(col1 varchar)'))`
   `select * from table(t(type=>'logRegex', schema=>'inline=(col1 varchar)'))`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293326805
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogBatchReader.java
 ##
 @@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.log;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
+import org.apache.drill.common.exceptions.UserException;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
+import org.apache.drill.exec.physical.rowSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
+import org.apache.hadoop.mapred.FileSplit;
+
+public class LogBatchReader implements ManagedReader {
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogBatchReader.class);
+  public static final String RAW_LINE_COL_NAME = "_raw";
+  public static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows";
+
+  private FileSplit split;
+  private final LogFormatConfig formatConfig;
+  private final Pattern pattern;
+  private final TupleMetadata schema;
+  private BufferedReader reader;
+  private int capturingGroups;
+  private ResultSetLoader loader;
+  private ScalarWriter rawColWriter;
+  private ScalarWriter unmatchedColWriter;
+  private boolean saveMatchedRows;
+  private int maxErrors;
+  private int lineNumber;
+  private int errorCount;
+
+  public LogBatchReader(LogFormatConfig formatConfig, Pattern pattern, 
TupleMetadata schema) {
+this.formatConfig = formatConfig;
+this.maxErrors = Math.max(0, formatConfig.getMaxErrors());
+this.pattern = pattern;
+this.schema = schema;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+split = negotiator.split();
+setupPattern();
+negotiator.setTableSchema(schema, true);
+loader = negotiator.build();
+bindColumns(loader.writer());
+openFile(negotiator);
+return true;
+  }
+
+  private void setupPattern() {
+try {
+  Matcher m = pattern.matcher("test");
+  capturingGroups = m.groupCount();
+} catch (PatternSyntaxException e) {
+  throw UserException
+  .validationError(e)
+  .message("Failed to parse regex: \"%s\"", formatConfig.getRegex())
+  .build(logger);
+}
+  }
+
+  private void bindColumns(RowSetLoader writer) {
+for (int i = 0; i < capturingGroups; i++) {
+  saveMatchedRows |= writer.scalar(i).isProjected();
+}
+rawColWriter = writer.scalar(RAW_LINE_COL_NAME);
+saveMatchedRows |= rawColWriter.isProjected();
+unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME);
+
+// If no match-case columns are projected, and the unmatched
+// columns is unprojected, then we want to count (matched)
+// rows.
+
+saveMatchedRows |= !unmatchedColWriter.isProjected();
+  }
+
+  private void openFile(FileSchemaNegotiator negotiator) {
+InputStream in;
+try {
+  in = negotiator.fileSystem().open(split.getPath());
+} catch (Exception e) {
+  throw UserException
+  .dataReadError(e)
+  .message("Failed to open open input file: %s", split.getPath())
+  .addContext("User name", negotiator.userName())
+  .build(logger);
+}
+reader = new BufferedReader(new InputStreamReader(in, Charsets.UTF_8));
+  }
+
+  @Override
+  public boolean next() {
+RowSetLoader rowWriter = loader.writer();
+while (! rowWriter.isFull()) {
+  if (! nextLine(rowWriter)) {
+return false;
+  }
+}
+return true;
+  }
+
+  private boolean nextLine(RowSetLoader rowWriter) {
+S

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293327274
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java
 ##
 @@ -18,86 +18,224 @@
 
 package org.apache.drill.exec.store.log;
 
-import java.io.IOException;
-import org.apache.drill.exec.planner.common.DrillStatsTable;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
 import org.apache.drill.common.exceptions.ExecutionSetupException;
-import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.exceptions.UserException;
 import org.apache.drill.common.logical.StoragePluginConfig;
-import org.apache.drill.exec.ops.FragmentContext;
-import org.apache.drill.exec.proto.UserBitShared;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
 import org.apache.drill.exec.server.DrillbitContext;
-import org.apache.drill.exec.store.RecordReader;
-import org.apache.drill.exec.store.RecordWriter;
-import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.server.options.OptionManager;
 import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
-import org.apache.drill.exec.store.dfs.easy.EasyWriter;
-import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 
-import java.util.List;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
-
 public class LogFormatPlugin extends EasyFormatPlugin {
+  public static final String PLUGIN_NAME = "logRegex";
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class);
+
+  private static class LogReaderFactory extends FileReaderFactory {
+private final LogFormatPlugin plugin;
+private final Pattern pattern;
+private final TupleMetadata schema;
+
+public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, 
TupleMetadata schema) {
+  this.plugin = plugin;
+  this.pattern = pattern;
+  this.schema = schema;
+}
 
-  public static final String DEFAULT_NAME = "logRegex";
-  private final LogFormatConfig formatConfig;
+@Override
+public ManagedReader newReader() {
+   return new LogBatchReader(plugin.getConfig(), pattern, schema);
+}
+  }
 
   public LogFormatPlugin(String name, DrillbitContext context,
  Configuration fsConf, StoragePluginConfig 
storageConfig,
  LogFormatConfig formatConfig) {
-super(name, context, fsConf, storageConfig, formatConfig,
-true,  // readable
-false, // writable
-true, // blockSplittable
-true,  // compressible
-Lists.newArrayList(formatConfig.getExtension()),
-DEFAULT_NAME);
-this.formatConfig = formatConfig;
+super(name, easyConfig(fsConf, formatConfig), context, storageConfig, 
formatConfig);
   }
 
-  @Override
-  public RecordReader getRecordReader(FragmentContext context,
-  DrillFileSystem dfs, FileWork fileWork, 
List columns,
-  String userName) throws 
ExecutionSetupException {
-return new LogRecordReader(context, dfs, fileWork,
-columns, userName, formatConfig);
+  private static EasyFormatConfig easyConfig(Configuration fsConf, 
LogFormatConfig pluginConfig) {
+EasyFormatConfig config = new EasyFormatConfig();
+config.readable = true;
+config.writable = false;
+// Should be block splitable, but logic not yet implemented.
+config.blockSplittable = false;
+config.compressible = true;
+config.supportsProjectPushdown = true;
+config.extensions = Lists.newArrayList(pluginConfig.getExtension());
+config.fsConf = fsConf;
+config.defaultName = PLUGIN_NAME;
+config.readerOperatorType = CoreOperatorType.REGEX_SUB_SCAN_VALUE;
+config.useEnhancedScan = true;
+

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293324511
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/README.md
 ##
 @@ -11,26 +18,50 @@ If you wanted to analyze log files such as the MySQL log 
sample shown below usin
 070917 16:29:01  21 Query   select * from location
 070917 16:29:12  21 Query   select * from location where id = 1 LIMIT 1
 ```
-This plugin will allow you to configure Drill to directly query logfiles of 
any configuration.
+
+Using this plugin, you can configure Drill to directly query log files of
+any configuration.
 
 ## Configuration Options
-* **`type`**:  This tells Drill which extension to use.  In this case, it must 
be `logRegex`.  This field is mandatory.
-* **`regex`**:  This is the regular expression which defines how the log file 
lines will be split.  You must enclose the parts of the regex in grouping 
parentheses that you wish to extract.  Note that this plugin uses Java regular 
expressions and requires that shortcuts such as `\d` have an additional slash:  
ie `\\d`.  This field is mandatory.
-* **`extension`**:  This option tells Drill which file extensions should be 
mapped to this configuration.  Note that you can have multiple configurations 
of this plugin to allow you to query various log files.  This field is 
mandatory.
-* **`maxErrors`**:  Log files can be inconsistent and messy.  The `maxErrors` 
variable allows you to set how many errors the reader will ignore before 
halting execution and throwing an error.  Defaults to 10.
-* **`schema`**:  The `schema` field is where you define the structure of the 
log file.  This section is optional.  If you do not define a schema, all fields 
will be assigned a column name of `field_n` where `n` is the index of the 
field. The undefined fields will be assigned a default data type of `VARCHAR`.
+
+* **`type`**:  This tells Drill which extension to use.  In this case, it must
+be `logRegex`.  This field is mandatory.
 
 Review comment:
   ```suggestion
   be `logRegex`. This field is mandatory.
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293323694
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java
 ##
 @@ -18,86 +18,224 @@
 
 package org.apache.drill.exec.store.log;
 
-import java.io.IOException;
-import org.apache.drill.exec.planner.common.DrillStatsTable;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
 import org.apache.drill.common.exceptions.ExecutionSetupException;
-import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.exceptions.UserException;
 import org.apache.drill.common.logical.StoragePluginConfig;
-import org.apache.drill.exec.ops.FragmentContext;
-import org.apache.drill.exec.proto.UserBitShared;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
 import org.apache.drill.exec.server.DrillbitContext;
-import org.apache.drill.exec.store.RecordReader;
-import org.apache.drill.exec.store.RecordWriter;
-import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.server.options.OptionManager;
 import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
-import org.apache.drill.exec.store.dfs.easy.EasyWriter;
-import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 
-import java.util.List;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
-
 public class LogFormatPlugin extends EasyFormatPlugin {
+  public static final String PLUGIN_NAME = "logRegex";
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class);
+
+  private static class LogReaderFactory extends FileReaderFactory {
+private final LogFormatPlugin plugin;
+private final Pattern pattern;
+private final TupleMetadata schema;
+
+public LogReaderFactory(LogFormatPlugin plugin, Pattern pattern, 
TupleMetadata schema) {
+  this.plugin = plugin;
+  this.pattern = pattern;
+  this.schema = schema;
+}
 
-  public static final String DEFAULT_NAME = "logRegex";
-  private final LogFormatConfig formatConfig;
+@Override
+public ManagedReader newReader() {
+   return new LogBatchReader(plugin.getConfig(), pattern, schema);
+}
+  }
 
   public LogFormatPlugin(String name, DrillbitContext context,
  Configuration fsConf, StoragePluginConfig 
storageConfig,
  LogFormatConfig formatConfig) {
-super(name, context, fsConf, storageConfig, formatConfig,
-true,  // readable
-false, // writable
-true, // blockSplittable
-true,  // compressible
-Lists.newArrayList(formatConfig.getExtension()),
-DEFAULT_NAME);
-this.formatConfig = formatConfig;
+super(name, easyConfig(fsConf, formatConfig), context, storageConfig, 
formatConfig);
   }
 
-  @Override
-  public RecordReader getRecordReader(FragmentContext context,
-  DrillFileSystem dfs, FileWork fileWork, 
List columns,
-  String userName) throws 
ExecutionSetupException {
-return new LogRecordReader(context, dfs, fileWork,
-columns, userName, formatConfig);
+  private static EasyFormatConfig easyConfig(Configuration fsConf, 
LogFormatConfig pluginConfig) {
+EasyFormatConfig config = new EasyFormatConfig();
+config.readable = true;
+config.writable = false;
+// Should be block splitable, but logic not yet implemented.
+config.blockSplittable = false;
+config.compressible = true;
+config.supportsProjectPushdown = true;
+config.extensions = Lists.newArrayList(pluginConfig.getExtension());
+config.fsConf = fsConf;
+config.defaultName = PLUGIN_NAME;
+config.readerOperatorType = CoreOperatorType.REGEX_SUB_SCAN_VALUE;
+config.useEnhancedScan = true;
+

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: Convert the regex ("log") plugin to use EVF

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1807: DRILL-7293: 
Convert the regex ("log") plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#discussion_r293323284
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/log/LogFormatPlugin.java
 ##
 @@ -18,86 +18,224 @@
 
 package org.apache.drill.exec.store.log;
 
-import java.io.IOException;
-import org.apache.drill.exec.planner.common.DrillStatsTable;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
+
 import org.apache.drill.common.exceptions.ExecutionSetupException;
-import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.exceptions.UserException;
 import org.apache.drill.common.logical.StoragePluginConfig;
-import org.apache.drill.exec.ops.FragmentContext;
-import org.apache.drill.exec.proto.UserBitShared;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.common.types.Types;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileReaderFactory;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileScanBuilder;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
 import org.apache.drill.exec.server.DrillbitContext;
-import org.apache.drill.exec.store.RecordReader;
-import org.apache.drill.exec.store.RecordWriter;
-import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.server.options.OptionManager;
 import org.apache.drill.exec.store.dfs.easy.EasyFormatPlugin;
-import org.apache.drill.exec.store.dfs.easy.EasyWriter;
-import org.apache.drill.exec.store.dfs.easy.FileWork;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
+import org.apache.drill.shaded.guava.com.google.common.base.Strings;
+import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 
-import java.util.List;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.Path;
-
 public class LogFormatPlugin extends EasyFormatPlugin {
+  public static final String PLUGIN_NAME = "logRegex";
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(LogFormatPlugin.class);
 
 Review comment:
   Same here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1806: DRILL-7292: Remove V1 and V2 text readers

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1806: DRILL-7292: 
Remove V1 and V2 text readers
URL: https://github.com/apache/drill/pull/1806#discussion_r293320197
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
 ##
 @@ -713,11 +713,26 @@ private ExecConstants() {
   public static final OptionValidator ENABLE_VERBOSE_ERRORS = new 
BooleanValidator(ENABLE_VERBOSE_ERRORS_KEY,
   new OptionDescription("Toggles verbose output of executable error 
messages"));
 
+  /**
+   * Key used in earlier versions to use the original ("V1") text reader. 
Since at least Drill 1.8
+   * users have used the ("compliant") ("V2") version. Deprecated in Drill 
1.17; the "V3" reader
+   * with schema support is always used. Retained for backward compatibility, 
but does
+   * nothing.
+   */
+  @Deprecated
   public static final String ENABLE_NEW_TEXT_READER_KEY = 
"exec.storage.enable_new_text_reader";
+  @Deprecated
   public static final OptionValidator ENABLE_NEW_TEXT_READER = new 
BooleanValidator(ENABLE_NEW_TEXT_READER_KEY,
   new OptionDescription("Enables the text reader that complies with the 
RFC 4180 standard for text/csv files."));
 
 Review comment:
   Please update option description in `new OptionDescription("Enables the text 
reader that complies with the RFC 4180 standard for text/csv files."));`, this 
is what users see on Web UI and options table.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #1806: DRILL-7292: Remove V1 and V2 text readers

2019-06-13 Thread GitBox

arina-ielchiieva commented on a change in pull request #1806: DRILL-7292: 
Remove V1 and V2 text readers
URL: https://github.com/apache/drill/pull/1806#discussion_r293320242
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
 ##
 @@ -713,11 +713,26 @@ private ExecConstants() {
   public static final OptionValidator ENABLE_VERBOSE_ERRORS = new 
BooleanValidator(ENABLE_VERBOSE_ERRORS_KEY,
   new OptionDescription("Toggles verbose output of executable error 
messages"));
 
+  /**
+   * Key used in earlier versions to use the original ("V1") text reader. 
Since at least Drill 1.8
+   * users have used the ("compliant") ("V2") version. Deprecated in Drill 
1.17; the "V3" reader
+   * with schema support is always used. Retained for backward compatibility, 
but does
+   * nothing.
+   */
+  @Deprecated
   public static final String ENABLE_NEW_TEXT_READER_KEY = 
"exec.storage.enable_new_text_reader";
+  @Deprecated
   public static final OptionValidator ENABLE_NEW_TEXT_READER = new 
BooleanValidator(ENABLE_NEW_TEXT_READER_KEY,
   new OptionDescription("Enables the text reader that complies with the 
RFC 4180 standard for text/csv files."));
 
+  /**
+   * Flag used in Drill 1.16 to select the row-set based ("V3") or the original
+   * "compliant" ("V2") text reader. In Drill 1.17, the "V3" version is always
+   * used. Retained for backward compatibility, but does nothing.
+   */
+  @Deprecated
   public static final String ENABLE_V3_TEXT_READER_KEY = 
"exec.storage.enable_v3_text_reader";
+  @Deprecated
   public static final OptionValidator ENABLE_V3_TEXT_READER = new 
BooleanValidator(ENABLE_V3_TEXT_READER_KEY,
   new OptionDescription("Enables the row set based version of the text/csv 
reader."));
 
 Review comment:
   Same here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services