date:20191021

[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956671#comment-16956671
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

paul-rogers commented on issue #1873: DRILL-6096: Provide mechanism to 
configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#issuecomment-544797392
 
 
   To answer my other comment; about the lack of ease-of-use with the current 
session options related to file formats: the answer is the schema mechanism you 
defined a while back. Even if no schema is provided, we should allow the user 
to override formatting properties using the table options encoded in that 
schema.
   
   Then, I can write a file using, say pipe delimiters, have that be recorded, 
and read the file automatically using those delimiters. Probably work to be 
done, but would we a nice solution that does not require HMS.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent

2019-10-21 Thread Paul Rogers (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956669#comment-16956669
 ] 

Paul Rogers commented on DRILL-7352:


Start with the [existing set of 
rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/].

* Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at 
the top.
* Use `final` aggressively on fields, do not use it on local variables or 
parameters.

Once decisions are finalized, update the format files for Eclipse and IntelliJ.

> Introduce new checkstyle rules to make code style more consistent
> -
>
> Key: DRILL-7352
> URL: https://issues.apache.org/jira/browse/DRILL-7352
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Priority: Major
> Fix For: 1.17.0
>
>
> Source - https://checkstyle.sourceforge.io/checks.html
> List of rules to be enabled:
> * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] 
> - force placement of a left curly brace at the end of the line.
> * 
> [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] 
> - force placement of a right curly brace
> * 
> [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile]
> * 
> [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses]
> * 
> [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad]
> * [InnerTypeLast 
> |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast]
> * 
> [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride]
> * 
> [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition]
> * 
> [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle]
> * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll]
> and others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956659#comment-16956659
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337321627
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatSelection.java
 ##
 @@ -63,6 +60,6 @@ public FileSelection getSelection(){
 
   @JsonIgnore
   public boolean supportDirPruning() {
 
 Review comment:
   As above `support` --> `supports`. Is safe because this value is not 
serialized.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956661#comment-16956661
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337322883
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/PcapngFormatPlugin.java
 ##
 @@ -47,7 +47,7 @@ public PcapngFormatPlugin(String name, DrillbitContext 
context, Configuration fs
 
   public PcapngFormatPlugin(String name, DrillbitContext context, 
Configuration fsConf, StoragePluginConfig config, PcapngFormatConfig 
formatPluginConfig) {
 super(name, context, fsConf, config, formatPluginConfig, true,
-false, true, false,
+false, true, true,
 
 Review comment:
   Isn't the middle `true` wrong? It is for `blockSplittable`. That means we'll 
start reading at an arbitrary block boundary. Since this is a binary format, it 
is not clear that we can scan forward to the beginning of the next record as 
can be done in Sequence File and (restricted) CSV.
   
   Also, if the file is zip-encoded, then it is never block splittable since 
Zip files cannot be read at an arbitrary offset.
   
   This creates an issue: the block-splittable attribute right now is a 
constant. But, if any file is zip-encoded, then it is never block splittable. 
Any way to handle this fact?
   
   And, any way to test this behaviour?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956657#comment-16956657
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337321291
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java
 ##
 @@ -386,17 +387,16 @@ public static void checkBackPaths(String parent, String 
combinedPath, String sub
 Preconditions.checkArgument(!combinedPath.isEmpty(), "Empty path (" + 
combinedPath + "( in file selection path.");
 
 if (!combinedPath.startsWith(parent)) {
-  StringBuilder msg = new StringBuilder();
-  msg.append("Invalid path : ").append(subpath).append(" takes you outside 
the workspace.");
-  throw new IllegalArgumentException(msg.toString());
+  throw new IllegalArgumentException(
+String.format("Invalid path [%s] takes you outside the workspace.", 
subpath));
 }
   }
 
   public List getFileStatuses() {
 return statuses;
   }
 
-  public boolean supportDirPrunig() {
+  public boolean supportDirPruning() {
 
 Review comment:
   Good catch. `suppportsDirPruning` (with an s)?
   
   The `support` form is imperative, it tells this object to support dir 
pruning. The `supports` form asks if this object does or does not support dir 
pruning.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956662#comment-16956662
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337322300
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ##
 @@ -59,33 +51,44 @@
 import org.apache.drill.exec.store.dfs.FormatSelection;
 import org.apache.drill.exec.store.dfs.MagicString;
 import org.apache.drill.exec.store.dfs.MetadataContext;
-import org.apache.drill.exec.store.mock.MockStorageEngine;
 import org.apache.drill.exec.store.parquet.metadata.Metadata;
 import org.apache.drill.exec.store.parquet.metadata.ParquetTableMetadataDirs;
 import org.apache.drill.exec.util.DrillFileSystemUtil;
 import org.apache.drill.shaded.guava.com.google.common.base.Stopwatch;
 import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableSet;
-import org.apache.drill.shaded.guava.com.google.common.collect.Lists;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FSDataInputStream;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.parquet.format.converter.ParquetMetadataConverter;
 import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.TimeUnit;
+import java.util.regex.Pattern;
 
 Review comment:
   Maybe change your IDE import order to put java above org? That way, there 
won't be constant import shuffling each time your IDE touches a file. (Yes, we 
should decide on preferred order and document it somewhere...)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956660#comment-16956660
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337321551
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemPlugin.java
 ##
 @@ -57,7 +61,9 @@
  */
 public class FileSystemPlugin extends AbstractStoragePlugin {
 
-  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(FileSystemPlugin.class);
+  private static final Logger logger = 
LoggerFactory.getLogger(FileSystemPlugin.class);
+
+  private static final List BUILT_IN_CODECS = 
Collections.singletonList(ZipCodec.class.getCanonicalName());
 
 Review comment:
   Are no other codecs provided "out of the box"? For others, I need to provide 
a jar and set a config option? Or, should we move the other built-in ones here 
and out of the config file?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956664#comment-16956664
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337321942
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.dfs;
+
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.DefaultCodec;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import java.util.zip.ZipOutputStream;
+
+/**
+ * ZIP codec implementation which cna read or create single entry.
+ * 
+ * Note: Do not rename this class. Class naming must be 'ZipCodec' so it can 
be mapped by
+ * {@link org.apache.hadoop.io.compress.CompressionCodecFactory} to the 'zip' 
extension.
+ */
+public class ZipCodec extends DefaultCodec {
+
+  private static final String EXTENSION = ".zip";
+
+  @Override
+  public CompressionOutputStream createOutputStream(OutputStream out) throws 
IOException {
+return new ZipCompressionOutputStream(new ResetableZipOutputStream(out));
+  }
+
+  @Override
+  public CompressionInputStream createInputStream(InputStream in) throws 
IOException {
+return new ZipCompressionInputStream(new ZipInputStream(in));
+  }
+
+  @Override
+  public String getDefaultExtension() {
+return EXTENSION;
+  }
+
+  /**
+   * Reads only first entry from {@link ZipInputStream},
+   * other entries if present will be ignored.
+   */
+  private static class ZipCompressionInputStream extends 
CompressionInputStream {
+
+ZipCompressionInputStream(ZipInputStream in) throws IOException {
+  super(in);
+  // positions stream at the beginning of the first entry data
+  in.getNextEntry();
+}
+
+@Override
+public int read() throws IOException {
+  return in.read();
+}
+
+@Override
+public int read(byte[] b, int off, int len) throws IOException {
+  return in.read(b, off, len);
+}
+
+@Override
+public void resetState() throws IOException {
+  in.reset();
+}
+
+@Override
+public void close() throws IOException {
+  try {
+((ZipInputStream) in).closeEntry();
+  } finally {
+super.close();
+  }
+}
+  }
+
+  /**
+   * Extends {@link ZipOutputStream} to allow resetting compressor stream,
+   * required by {@link CompressionOutputStream} implementation.
+   */
+  private static class ResetableZipOutputStream extends ZipOutputStream {
+
+ResetableZipOutputStream(OutputStream out) {
+  super(out);
+}
+
+void resetState() {
+  def.reset();
+}
+  }
+
+  /**
+   * Writes given data into ZIP archive by placing all data in one entry with 
default naming.
+   */
+  private static class ZipCompressionOutputStream extends 
CompressionOutputStream {
+
+private static final String DEFAULT_ENTRY_NAME = "entry.out";
 
 Review comment:
   Should the entry name be the same as the file name so it is sensible if 
someone unzips the file?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956663#comment-16956663
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337322969
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/package-info.java
 ##
 @@ -16,7 +16,7 @@
  * limitations under the License.
  */
 /**
- * For comments on realization of this format plugin look at :
+ * For comments on implementation of this format plugin look at:
 
 Review comment:
   "look at" --> "see"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956658#comment-16956658
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression
URL: https://github.com/apache/drill/pull/1879#discussion_r337321743
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java
 ##
 @@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.dfs;
+
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.DefaultCodec;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import java.util.zip.ZipOutputStream;
+
+/**
+ * ZIP codec implementation which cna read or create single entry.
+ * 
+ * Note: Do not rename this class. Class naming must be 'ZipCodec' so it can 
be mapped by
+ * {@link org.apache.hadoop.io.compress.CompressionCodecFactory} to the 'zip' 
extension.
+ */
+public class ZipCodec extends DefaultCodec {
+
+  private static final String EXTENSION = ".zip";
 
 Review comment:
   Any need to support G-zip? (`.gz`) or Tar/g-zip (`.tar.gz`)?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956646#comment-16956646
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

paul-rogers commented on pull request #1878: DRILL-7414: EVF incorrectly sets 
buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#discussion_r337311938
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/resultSet/impl/TestResultSetLoaderOverflow.java
 ##
 @@ -59,55 +61,64 @@
 
   @Test
   public void testVectorSizeLimit() {
-TupleMetadata schema = new SchemaBuilder()
+final TupleMetadata schema = new SchemaBuilder()
 
 Review comment:
   Sorry, too much functional programming with Scala in my "real job"; have 
gotten used to marking variables 'val` rather than `var`. `final` is the Java 
equivalent. But, since Drill does not normally use this convention, removed the 
unneeded `final` keywords.
   
   You are right; if there is a performance benefit, the compiler will figure 
out that the variable is never modified.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956647#comment-16956647
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

paul-rogers commented on pull request #1878: DRILL-7414: EVF incorrectly sets 
buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#discussion_r337312721
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/OffsetVectorWriterImpl.java
 ##
 @@ -290,7 +290,7 @@ public void preRollover() {
 // rows. But, this being an offset vector, we add one to account
 // for the extra 0 value at the start.
 
-setValueCount(vectorIndex.rowStartIndex() + 1);
+setValueCount(vectorIndex.rowStartIndex());
 
 Review comment:
   Updated the comment as it was too subtle. The row start index is already at 
the proper index, it points past the last valid value (it points to where we'd 
add the next value, if we had one.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956609#comment-16956609
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

paul-rogers commented on issue #1871: DRILL-7403: Validate batch checks, vector 
integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#issuecomment-544779743
 
 
   Thanks much for the review!
   
   Made requested changes. Rebased on master. Squashed commits.
   
   Once this is merged, I'll update the two new PRs to eliminate the commits 
duplicated with this one.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
> default. This enhancement adds more checks, including checks for row counts 
> (of which there are surprisingly many.)
> Since most operators will fail if the check is enabled, this enhancement also 
> adds a table to keep track of which operators pass the checks (and for which 
> checks should be enabled) and those that still need work. This allows the 
> checks to exist in the code, and to be enabled incrementally as we fix the 
> various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956608#comment-16956608
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

paul-rogers commented on issue #1871: DRILL-7403: Validate batch checks, vector 
integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#issuecomment-544779743
 
 
   Made requested changes. Rebased on master. Squashed commits.
   
   Once this is merged, I'll update the two new PRs to eliminate the commits 
duplicated with this one.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
> default. This enhancement adds more checks, including checks for row counts 
> (of which there are surprisingly many.)
> Since most operators will fail if the check is enabled, this enhancement also 
> adds a table to keep track of which operators pass the checks (and for which 
> checks should be enabled) and those that still need work. This allows the 
> checks to exist in the code, and to be enabled incrementally as we fix the 
> various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956602#comment-16956602
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

paul-rogers commented on pull request #1871: DRILL-7403: Validate batch checks, 
vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#discussion_r337308769
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java
 ##
 @@ -150,57 +424,63 @@ private int validateOffsetVector(String name, 
UInt4Vector offsetVector, int valu
   error(name, offsetVector, "Offset (0) must be 0 but was " + prevOffset);
 }
 
-// Note <= comparison: offset vectors have (n+1) entries.
-
-for (int i = 1; i <= valueCount; i++) {
-  int offset = accessor.get(i);
+for (int i = 1; i < offsetCount; i++) {
+  final int offset = accessor.get(i);
   if (offset < prevOffset) {
-error(name, offsetVector, "Decreasing offsets at (" + (i-1) + ", " + i 
+ ") = (" + prevOffset + ", " + offset + ")");
+error(name, offsetVector, String.format(
+"Offset vector [%d] contained %d, expected >= %d",
+i, offset, prevOffset));
   } else if (offset > maxOffset) {
-error(name, offsetVector, "Invalid offset at index " + i + " = " + 
offset + " exceeds maximum of " + maxOffset);
+error(name, offsetVector, String.format(
+"Invalid offset at index %d: %d exceeds maximum of %d",
+i, offset, maxOffset));
   }
   prevOffset = offset;
 }
 return prevOffset;
   }
 
   private void error(String name, ValueVector vector, String msg) {
-if (errorCount == 0) {
-  logger.error("Found one or more vector errors from " + 
batch.getClass().getSimpleName());
-}
-errorCount++;
-if (errorCount >= MAX_ERRORS) {
-  return;
-}
-String fullMsg = "Column " + name + " of type " + 
vector.getClass().getSimpleName( ) + ": " + msg;
-logger.error(fullMsg);
-if (errorList != null) {
-  errorList.add(fullMsg);
-}
+errorReporter.error(name, vector, msg);
   }
 
-  private void validateNullableVector(String name, NullableVector vector) {
-// Can't validate at this time because the bits vector is in each
-// generated subtype.
-
-// Validate a VarChar vector because it is common.
-
-if (vector instanceof NullableVarCharVector) {
-  VarCharVector values = ((NullableVarCharVector) 
vector).getValuesVector();
-  validateVarCharVector(name + "-values", values, rowCount);
+  private void verifyIsSetVector(ValueVector parent, UInt1Vector bv) {
+final String name = String.format("%s (%s)-bits",
+parent.getField().getName(),
+parent.getClass().getSimpleName());
+final int rowCount = parent.getAccessor().getValueCount();
+final int bitCount = bv.getAccessor().getValueCount();
+if (bitCount != rowCount) {
+  error(name, bv, String.format(
+  "Value count = %d, but bit count = %d",
+  rowCount, bitCount));
+}
+final UInt1Vector.Accessor ba = bv.getAccessor();
+for (int i = 0; i < bitCount; i++) {
+  final int value = ba.get(i);
+  if (value != 0 && value != 1) {
+error(name, bv, String.format(
+"%s %s: bit vector[%d] = %d, expected 0 or 1",
+i, value));
+  }
 }
-  }
-
-  private void validateFixedWidthVector(String name, FixedWidthVector vector) {
-// TODO Auto-generated method stub
-
   }
 
   /**
-   * Obtain the list of errors. For use in unit-testing this class.
-   * @return the list of errors found, or null if error capture was
-   * not enabled
+   * Print a record batch. Uses code only available in a test build.
+   * Classes are not visible to the compiler; must load dynamically.
+   * Does nothing if the class is not available.
*/
 
-  public List errors() { return errorList; }
+  public static void print(RecordBatch batch) {
+try {
+  final Class helper = 
Class.forName("org.apache.drill.test.rowSet.RowSetUtilities");
 
 Review comment:
   Removed for now, will add back later (and fix any issues) if/when needed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Pa

[jira] [Closed] (DRILL-7417) Test Task

2019-10-21 Thread Sorabh Hamirwasia (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia closed DRILL-7417.

Resolution: Invalid

> Test Task
> -
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7417) Test Task

2019-10-21 Thread Sorabh Hamirwasia (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-7417:
-
Attachment: Test.rtf

> Test Task
> -
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7417) Test Task

2019-10-21 Thread Sorabh Hamirwasia (Jira)

Sorabh Hamirwasia created DRILL-7417:


 Summary: Test Task
 Key: DRILL-7417
 URL: https://issues.apache.org/jira/browse/DRILL-7417
 Project: Apache Drill
  Issue Type: Task
Reporter: Sorabh Hamirwasia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7417) Test Task

2019-10-21 Thread Sorabh Hamirwasia (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sorabh Hamirwasia updated DRILL-7417:
-
Attachment: (was: Test.rtf)

> Test Task
> -
>
> Key: DRILL-7417
> URL: https://issues.apache.org/jira/browse/DRILL-7417
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Sorabh Hamirwasia
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-21 Thread Abhishek Girish (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Girish closed DRILL-7405.
--

> Build fails due to inaccessible apache-drill on S3 storage
> --
>
> Key: DRILL-7405
> URL: https://issues.apache.org/jira/browse/DRILL-7405
> Project: Apache Drill
>  Issue Type: Task
>  Components: Tools, Build & Test
>Affects Versions: 1.16.0
>Reporter: Boaz Ben-Zvi
>Assignee: Abhishek Girish
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>   A new clean build (e.g. after deleting the ~/.m2 local repository) would 
> fail now due to:  
> Access denied to: 
> [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=]
>  
> (e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )
> A new publicly available storage place is needed, plus appropriate changes in 
> Drill to get to these resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956157#comment-16956157
 ] 

ASF GitHub Bot commented on DRILL-5674:
---

arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP 
compression
URL: https://github.com/apache/drill/pull/1879
 
 
   1. Added ZipCodec implementation which can read / write single file.
   2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' 
method is used in those which support compression.
   3. Added unit tests.
   4. General refactoring.
   
   Jira - [DRILL-5674](https://issues.apache.org/jira/browse/DRILL-5674).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-5674) Drill should support .zip compression

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-5674:

Reviewer: Vova Vysotskyi

> Drill should support .zip compression
> -
>
> Key: DRILL-5674
> URL: https://issues.apache.org/jira/browse/DRILL-5674
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> Zip is a very common compression format. Create a compressed CSV file with 
> column headers: data.csv.zip.
> Define a storage plugin config for the file, call it "dfs.myws", set 
> delimiter = ",", extract header = true, skip header = false.
> Run a simple query:
> SELECT * FROM dfs.myws.`data.csv.zip`
> The result is garbage as the CSV reader is trying to parse Zipped data as if 
> it were text.
> DRILL-5506 asks how to do this; the responder said to add a library to the 
> path. Better would be to simply support zip out-of-the-box as a default 
> format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7416) Updates required to dependencies to resolve potential security vulnerabilities

2019-10-21 Thread Bradley Parker (Jira)

Bradley Parker created DRILL-7416:
-

 Summary: Updates required to dependencies to resolve potential 
security vulnerabilities 
 Key: DRILL-7416
 URL: https://issues.apache.org/jira/browse/DRILL-7416
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Bradley Parker


After running an OWASP Dependency Check and ruling out false positives, I have 
found 25 dependencies that should be updated to remove potential 
vulnerabilities. They are listed alphabetically with their CVE information 
below.

 

[CVSS 
scores|[https://en.wikipedia.org/wiki/Common_Vulnerability_Scoring_System]] 
represent the severity of a vulnerability on a scale of 1-10, 10 being 
critical. [CVEs 
|[https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_Exposures]] are 
public identifiers used to reference known vulnerabilities. 

 

Package: avro-1.8.2
Should be: 1.9.0 (*Existing item at* *DRILL-7302*)
Max CVE (CVSS): CVE-2018-10237 (5.9)
Complete CVE list: CVE-2018-10237

Package: commons-beanutils-1.9.2
Should be: 1.9.4
Max CVE (CVSS): CVE-2019-10086 (7.3)
Complete CVE list: CVE-2019-10086

Package: commons-beanutils-core-1.8.0
Should be: Moved to commons-beanutils
Max CVE (CVSS): CVE-2014-0114 (7.5)
Complete CVE list: CVE-2014-0114Deprecated, replaced by commons-beanutils

Package: converter-jackson
Should be: 2.5.0
Max CVE (CVSS): CVE-2018-1000850 (7.5)
Complete CVE list: CVE-2018-1000850

Package: derby-10.10.2.0
Should be: 10.14.2.0
Max CVE (CVSS): CVE-2015-1832 (9.1)
Complete CVE list: CVE-2015-1832
CVE-2018-1313

Package: drill-hive-exec-shaded
Should be: New release needed with updated Guava
Max CVE (CVSS): CVE-2018-10237 (7.5)
Complete CVE list: CVE-2018-10237

Package: drill-java-exec
Should be: New release needed with updated JjQuery and Bootstrap
Max CVE (CVSS): CVE-2019-11358 (6.1)
Complete CVE list: CVE-2018-14040
CVE-2018-14041 
CVE-2018-14042
CVE-2019-8331
CVE-2019-11358

Package: drill-shaded-guava-23
Should be: New release needed with updated Guava
Max CVE (CVSS): CVE-2018-10237 (5.9)
Complete CVE list: CVE-2018-10237

Package: guava-19.0
Should be: 24.1.1
Max CVE (CVSS): CVE-2018-10237 (5.9)
Complete CVE list: CVE-2018-10237

Package: hadoop-yarn-common-2.7.4
Should be: 3.2.1
Max CVE (CVSS): CVE-2019-11358 (6.1)
Complete CVE list: CVE-2012-6708
CVE-2015-9251
CVE-2019-11358
CVE-2010-5312
CVE-2016-7103

Package: hbase-http-2.1.1.jar 
Should be: 2.1.4
Max CVE (CVSS): CVE-2019-0212 (7.5)
Complete CVE list: CVE-2019-0212

Package: httpclient-4.2.5.jar
Should be: 4.3.6
Max CVE (CVSS): CVE-2014-3577  (5.8)
Complete CVE list: CVE-2014-3577
CVE-2015-5262

Package: jackson-databind-2.9.5
Should be: 2.10.0
Max CVE (CVSS): CVE-2018-14721  (10)
Complete CVE list: CVE-2019-17267
CVE-2019-16943
CVE-2019-16942
CVE-2019-16335
CVE-2019-14540
CVE-2019-14439
CVE-2019-14379
CVE-2018-11307
CVE-2019-12384
CVE-2019-12814
CVE-2019-12086
CVE-2018-12023
CVE-2018-12022
CVE-2018-19362
CVE-2018-19361
CVE-2018-19360
CVE-2018-14721
CVE-2018-14720
CVE-2018-14719
CVE-2018-14718
CVE-2018-1000873

Package: jetty-server-9.3.25.v20180904.jar (*Existing DRILL-7135, but that's to 
go to 9.4 and it's blocked, we should go to latest 9.3 in the meantime*)
Should be: 9.3.27.v20190418
Max CVE (CVSS): CVE-2017-9735 (7.5)
Complete CVE list: CVE-2017-9735
CVE-2019-10241
CVE-2019-10247

Package: Kafka 0.11.0.1
Should be: 2.2.0 (*Existing item DRILL-6739*)
Max CVE (CVSS): CVE-2018-17196 (8.8)
Complete CVE list: CVE-2018-17196
CVE-2018-1288
CVE-2017-12610

Package: kudu-client-1.3.0.jar 
Should be: 1.10.0
Max CVE (CVSS): CVE-2015-5237  (8.8)
Complete CVE list: CVE-2018-10237
CVE-2015-5237
CVE-2019-16869Only a partial fix, no fix for netty CVE-2019-16869 (7.5), kudu 
still needs to update their netty (this is not unexpected as this CVE is newer)

Package: libfb303-0.9.3.jar
Should be: 0.12.0
Max CVE (CVSS): CVE-2018-1320 (7.5)
Complete CVE list: CVE-2018-1320Moved to libthrift

Package: okhttp-3.3.0
Should be: 3.12.0
Max CVE (CVSS): CVE-2018-20200 (5.9)
Complete CVE list: CVE-2018-20200

Package: protobuf-java-2.5.0
Should be: 3.4.0
Max CVE (CVSS): CVE-2015-5237  (8.8)
Complete CVE list: CVE-2015-5237 

Package: retrofit-2.1.0
Should be: 2.5.0
Max CVE (CVSS): CVE-2018-1000850 (7.5)
Complete CVE list: CVE-2018-1000850

Package: scala-library-2.11.0
Should be: 2.11.12
Max CVE (CVSS): CVE-2017-15288 (7.8)
Complete CVE list: CVE-2017-15288

Package: serializer-2.7.1
Should be: 2.7.2
Max CVE (CVSS): CVE-2014-0107 (7.5)
Complete CVE list: CVE-2014-0107

Package: xalan-2.7.1
Should be: 2.7.2
Max CVE (CVSS): CVE-2014-0107 (7.5)
Complete CVE list: CVE-2014-0107

Package: xercesImpl-2.11.0
Should be: 2.12.0
Max CVE (CVSS): CVE-2012-0881 (7.5)
Complete CVE list: CVE-2012-0881

Package: zookeeper-3.4.12.
Should be: 3.4.14
Max CVE (CVSS): CVE-2019-0201 (5.9)
Complete CVE list: CVE-2019-0201

 

Additional keywords for searching: Vulnerabili

[jira] [Updated] (DRILL-3850) Execute multiple commands from sqlline -q

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-3850:

Fix Version/s: 1.17.0

> Execute multiple commands from sqlline -q
> -
>
> Key: DRILL-3850
> URL: https://issues.apache.org/jira/browse/DRILL-3850
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - CLI
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mint 17.1
>Reporter: Philip Deegan
>Priority: Major
> Fix For: 1.17.0
>
>
> Be able to perform
> {noformat}
> ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp; alter session set 
> \`store.format\`='csv';"
> {noformat}
> instead of 
> {noformat}
> ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp;"
> ./sqlline -u jdbc:drill:zk=local -q "alter session set 
> \`store.format\`='csv';"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (DRILL-3850) Execute multiple commands from sqlline -q

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-3850.
-
Resolution: Fixed

Fixed in the scope of DRILL-7401.

> Execute multiple commands from sqlline -q
> -
>
> Key: DRILL-3850
> URL: https://issues.apache.org/jira/browse/DRILL-3850
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - CLI
>Affects Versions: 1.1.0, 1.2.0
> Environment: Mint 17.1
>Reporter: Philip Deegan
>Priority: Major
> Fix For: 1.17.0
>
>
> Be able to perform
> {noformat}
> ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp; alter session set 
> \`store.format\`='csv';"
> {noformat}
> instead of 
> {noformat}
> ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp;"
> ./sqlline -u jdbc:drill:zk=local -q "alter session set 
> \`store.format\`='csv';"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (DRILL-7415) Information schema query fails for postgres foreign tables

2019-10-21 Thread Igor Guzenko (Jira)

Igor Guzenko created DRILL-7415:
---

 Summary: Information schema query fails for postgres foreign tables
 Key: DRILL-7415
 URL: https://issues.apache.org/jira/browse/DRILL-7415
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Igor Guzenko
Assignee: Igor Guzenko
 Fix For: Future


1) Setup a JDBC driver in Drill to Postgres
2) Create a public foreign tables like below in postgres

 public   | vessel   | foreign table | postgres
 public   | vessel_movement  | foreign table | postgres
 public   | vessel_movement_hist | foreign table | postgres
3) Execute query in Drill 
{code:sql}SELECT * FROM `INFORMATION_SCHEMA`.`TABLES`;{code}

*Actual result*
{code}
Caused by: java.lang.IllegalArgumentException: Multiple entries with same key: 
vessel=JdbcTable {vessel} and vessel=JdbcTable {vessel}
at 
com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:136) 
~[guava-19.0.jar:na]
at 
com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:98)
 ~[guava-19.0.jar:na]
at 
com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:84)
 ~[guava-19.0.jar:na]
at 
com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:295) 
~[guava-19.0.jar:na]
at 
org.apache.calcite.adapter.jdbc.JdbcSchema.computeTables(JdbcSchema.java:269) 
~[calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0]
at 
org.apache.calcite.adapter.jdbc.JdbcSchema.getTableMap(JdbcSchema.java:285) 
~[calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0]
at 
org.apache.calcite.adapter.jdbc.JdbcSchema.getTableNames(JdbcSchema.java:410) 
~[calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0]
at 
org.apache.drill.exec.store.jdbc.JdbcStoragePlugin$CapitalizingJdbcSchema.getTableNames(JdbcStoragePlugin.java:282)
 ~[drill-jdbc-storage-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.AbstractSchema.getTableNamesAndTypes(AbstractSchema.java:299)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator$Tables.visitTables(InfoSchemaRecordGenerator.java:340)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:254)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:247)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:247)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:234)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaTableType.getRecordReader(InfoSchemaTableType.java:58)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaBatchCreator.getBatch(InfoSchemaBatchCreator.java:34)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.store.ischema.InfoSchemaBatchCreator.getBatch(InfoSchemaBatchCreator.java:30)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:159)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:137)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:110)
 ~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) 
~[drill-java-exec-1.16.0.jar:1.16.0]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:263)
 [drill-java-exec-1.16.0.jar:1.16.0]
{code}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7401) Sqlline 1.9 upgrade

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956046#comment-16956046
 ] 

ASF GitHub Bot commented on DRILL-7401:
---

asfgit commented on pull request #1875: DRILL-7401: Upgrade to SqlLine 1.9.0
URL: https://github.com/apache/drill/pull/1875
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Sqlline 1.9 upgrade
> ---
>
> Key: DRILL-7401
> URL: https://issues.apache.org/jira/browse/DRILL-7401
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Upgrade to SqlLine 1.9 once it is released 
> (https://github.com/julianhyde/sqlline/issues/350).
> *TODO:*
> 1. Add SqlLine properties: 
> {{connectInteractionMode: useNPTogetherOrEmpty}} - supports connection 
> mehanism used in SqlLine 1.17 and earlier:
> a. if user and password are not indicated, connects without them (user and 
> password are set t empty string): {{./drill-embedded}}
> b. if user is indicated, asks for password in interactive mode: 
> {{./drill-embedded -n "user1"}}
> c. if user is indicated as empty string, behaives like in point a (user and 
> password are set t empty string): {{./drill-embedded -n ""}}
> d. if user and password are indicated, connects using provided input 
> {{./drill-embedded -n "user1" -p "123"}}
> {{showLineNumbers: true}} - adds line numbers when query is more than one 
> line:
> {noformat}
> apache drill> select
> 2..semicolon> *
> 3..semicolon> from
> 4..semicolon> sys.version;
> {noformat}
> 2. Remove nohup support code from sqlline.sh since it is not needed any more 
> (nohup support wroks without flag):
> {code}
> To add nohup support for SQLline script
> if [[ ( ! $(ps -o stat= -p $$) =~ "+" ) && ! ( -p /dev/stdin ) ]]; then
>export SQLLINE_JAVA_OPTS="$SQLLINE_JAVA_OPTS 
> -Djline.terminal=jline.UnsupportedTerminal"
> fi
> {code}
> 3. Add {{-Dorg.jline.terminal.dumb=true}} to avoid JLine terminal warning 
> when submitting query in sqlline.sh to execute via {{-e}} or {{-f}}:
> {noformat}
> Oct 11, 2019 2:14:45 PM org.jline.utils.Log logr
> WARNING: Unable to create a system terminal, creating a dumb terminal (enable 
> debug logging for more information)
> {noformat}
> 4. Remove unneeded echo commands in sqlline.bat during start up:
> {noformat}
> drill-embedded.bat
> DRILL_ARGS - " -u jdbc:drill:zk=local -n user1 -p ppp"
> Calculating HADOOP_CLASSPATH ...
> HBASE_HOME not detected...
> Calculating Drill classpath...
> Apache Drill 1.17.0-SNAPSHOT
> "Data is the new oil. Ready to Drill some?"
> apache drill>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956049#comment-16956049
 ] 

ASF GitHub Bot commented on DRILL-7405:
---

asfgit commented on pull request #1874: DRILL-7405: Avoiding download of TPC-H 
data
URL: https://github.com/apache/drill/pull/1874
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Build fails due to inaccessible apache-drill on S3 storage
> --
>
> Key: DRILL-7405
> URL: https://issues.apache.org/jira/browse/DRILL-7405
> Project: Apache Drill
>  Issue Type: Task
>  Components: Tools, Build & Test
>Affects Versions: 1.16.0
>Reporter: Boaz Ben-Zvi
>Assignee: Abhishek Girish
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>   A new clean build (e.g. after deleting the ~/.m2 local repository) would 
> fail now due to:  
> Access denied to: 
> [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=]
>  
> (e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )
> A new publicly available storage place is needed, plus appropriate changes in 
> Drill to get to these resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7402) Suppress batch dumps for expected failures in tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956047#comment-16956047
 ] 

ASF GitHub Bot commented on DRILL-7402:
---

asfgit commented on pull request #1872: DRILL-7402: Suppress batch dumps for 
expected failures in tests
URL: https://github.com/apache/drill/pull/1872
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Suppress batch dumps for expected failures in tests
> ---
>
> Key: DRILL-7402
> URL: https://issues.apache.org/jira/browse/DRILL-7402
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a way to dump the last few batches when an error occurs. 
> However, in tests, we often deliberately cause something to fail. In this 
> case, the batch dump is unnecessary.
> This enhancement adds a config property, disabled in tests, that controls the 
> dump activity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7412) Minor unit test improvements

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956048#comment-16956048
 ] 

ASF GitHub Bot commented on DRILL-7412:
---

asfgit commented on pull request #1876: DRILL-7412: Minor unit test improvements
URL: https://github.com/apache/drill/pull/1876
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Minor unit test improvements
> 
>
> Key: DRILL-7412
> URL: https://issues.apache.org/jira/browse/DRILL-7412
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Many tests intentionally trigger errors. A debug-only log setting sent those 
> errors to stdout. The resulting stack dumps simply cluttered the test output, 
> so disabled error output to the console.
> Drill can apply bounds checks to vectors. Tests run via Maven enable bounds 
> checking. Now, bounds checking is also enabled in "debug mode" (when 
> assertions are enabled, as in an IDE.)
> Drill contains two test frameworks. The older BaseTestQuery was marked as 
> deprecated, but many tests still use it and are unlikely to be changed soon. 
> So, removed the deprecated marker to reduce the number of spurious warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955998#comment-16955998
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on pull request #1878: DRILL-7414: EVF incorrectly 
sets buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#discussion_r336972303
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/resultSet/impl/TestResultSetLoaderOverflow.java
 ##
 @@ -59,55 +61,64 @@
 
   @Test
   public void testVectorSizeLimit() {
-TupleMetadata schema = new SchemaBuilder()
+final TupleMetadata schema = new SchemaBuilder()
 
 Review comment:
   Since Java supports effective final there is no need to excessive use of 
final keyword unless you want to explicitly indicate that variable is final. I 
am not going to request the change during code review, I guess this mostly 
relates to the developers code writing style but since you are making lots of 
refactoring adding final keyword, just wanted to highlight that this is might 
be unnecessary.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-7414:
---

Assignee: Paul Rogers  (was: Arina Ielchiieva)

> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7414:

Reviewer: Arina Ielchiieva

> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7414:

Fix Version/s: 1.17.0

> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Minor
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7414:

Labels: ready-to-commit  (was: )

> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-7414:
---

Assignee: Arina Ielchiieva  (was: Paul Rogers)

> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Minor
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955992#comment-16955992
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on pull request #1878: DRILL-7414: EVF incorrectly 
sets buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#discussion_r336972303
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/resultSet/impl/TestResultSetLoaderOverflow.java
 ##
 @@ -59,55 +61,64 @@
 
   @Test
   public void testVectorSizeLimit() {
-TupleMetadata schema = new SchemaBuilder()
+final TupleMetadata schema = new SchemaBuilder()
 
 Review comment:
   Since Java supports effective final there is no need to excessive use of 
final keyword unless s you want to explicitly indicate that variable is final. 
I am not going to request the change during code review, I guess this mostly 
relates to the developers code writing style but since you are making lots of 
refactoring adding final keyword, just wanted to highlight that this is might 
be unnecessary.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955991#comment-16955991
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on pull request #1878: DRILL-7414: EVF incorrectly 
sets buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#discussion_r336971442
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/OffsetVectorWriterImpl.java
 ##
 @@ -290,7 +290,7 @@ public void preRollover() {
 // rows. But, this being an offset vector, we add one to account
 // for the extra 0 value at the start.
 
-setValueCount(vectorIndex.rowStartIndex() + 1);
+setValueCount(vectorIndex.rowStartIndex());
 
 Review comment:
   ` But, this being an offset vector, we add one to account for the extra 0 
value at the start.` - Should this comment be updated after the change?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955993#comment-16955993
 ] 

ASF GitHub Bot commented on DRILL-7414:
---

arina-ielchiieva commented on issue #1878: DRILL-7414: EVF incorrectly sets 
buffer writer index after rollover
URL: https://github.com/apache/drill/pull/1878#issuecomment-544478701
 
 
   @paul-rogers mostly looks good, one minor concern about if need to update 
the comment after code change.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> EVF incorrectly sets buffer writer index after rollover
> ---
>
> Key: DRILL-7414
> URL: https://issues.apache.org/jira/browse/DRILL-7414
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> A full test run, with vector validation enabled and with the "new" scan 
> enabled,  revealed the following in {{TestMockPlugin.testSizeLimit()}}:
> {noformat}
> comments_s2 - VarCharVector: Row count = 838, but value count = 839
> {noformat}
> Adding vector validation to the result set loader overflow tests reveals that 
> the problem is in overflow. In 
> {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}:
> {noformat}
> a - RepeatedIntVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Row count = 2952, but value count = 2953
> b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels 
> 32472 values
> c - RepeatedIntVector: Row count = 2952, but value count = 2953
> d - RepeatedIntVector: Row count = 2952, but value count = 2953
> {noformat}
> The problem is that EVF incorrectly sets the offset buffer writer index after 
> a rollover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7413) Scan operator does not set the container record count

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7413:

Reviewer: Arina Ielchiieva

> Scan operator does not set the container record count
> -
>
> Key: DRILL-7413
> URL: https://issues.apache.org/jira/browse/DRILL-7413
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
> reader. You will get the following error:
> {noformat}
> 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> ScanBatch
> ScanBatch: Container record count not set
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7413) Scan operator does not set the container record count

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7413:

Fix Version/s: 1.17.0

> Scan operator does not set the container record count
> -
>
> Key: DRILL-7413
> URL: https://issues.apache.org/jira/browse/DRILL-7413
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
> reader. You will get the following error:
> {noformat}
> 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> ScanBatch
> ScanBatch: Container record count not set
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7413) Scan operator does not set the container record count

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955986#comment-16955986
 ] 

ASF GitHub Bot commented on DRILL-7413:
---

arina-ielchiieva commented on issue #1877: DRILL-7413: Test and fix scan 
operator vectors
URL: https://github.com/apache/drill/pull/1877#issuecomment-544476885
 
 
   LGTM, +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Scan operator does not set the container record count
> -
>
> Key: DRILL-7413
> URL: https://issues.apache.org/jira/browse/DRILL-7413
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
> reader. You will get the following error:
> {noformat}
> 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> ScanBatch
> ScanBatch: Container record count not set
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7413) Scan operator does not set the container record count

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955985#comment-16955985
 ] 

ASF GitHub Bot commented on DRILL-7413:
---

arina-ielchiieva commented on issue #1877: DRILL-7413: Test and fix scan 
operator vectors
URL: https://github.com/apache/drill/pull/1877#issuecomment-544476885
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Scan operator does not set the container record count
> -
>
> Key: DRILL-7413
> URL: https://issues.apache.org/jira/browse/DRILL-7413
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
> reader. You will get the following error:
> {noformat}
> 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> ScanBatch
> ScanBatch: Container record count not set
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7413) Scan operator does not set the container record count

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7413:

Labels: ready-to-commit  (was: )

> Scan operator does not set the container record count
> -
>
> Key: DRILL-7413
> URL: https://issues.apache.org/jira/browse/DRILL-7413
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
>
> Enable the vector checking provided in DRILL-7403. Enable just for the JSON 
> reader. You will get the following error:
> {noformat}
> 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> ScanBatch
> ScanBatch: Container record count not set
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7405:

Labels: ready-to-commit  (was: )

> Build fails due to inaccessible apache-drill on S3 storage
> --
>
> Key: DRILL-7405
> URL: https://issues.apache.org/jira/browse/DRILL-7405
> Project: Apache Drill
>  Issue Type: Task
>  Components: Tools, Build & Test
>Affects Versions: 1.16.0
>Reporter: Boaz Ben-Zvi
>Assignee: Abhishek Girish
>Priority: Critical
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>   A new clean build (e.g. after deleting the ~/.m2 local repository) would 
> fail now due to:  
> Access denied to: 
> [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=]
>  
> (e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )
> A new publicly available storage place is needed, plus appropriate changes in 
> Drill to get to these resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955984#comment-16955984
 ] 

ASF GitHub Bot commented on DRILL-7405:
---

arina-ielchiieva commented on issue #1874: DRILL-7405: Avoiding download of 
TPC-H data
URL: https://github.com/apache/drill/pull/1874#issuecomment-544476019
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Build fails due to inaccessible apache-drill on S3 storage
> --
>
> Key: DRILL-7405
> URL: https://issues.apache.org/jira/browse/DRILL-7405
> Project: Apache Drill
>  Issue Type: Task
>  Components: Tools, Build & Test
>Affects Versions: 1.16.0
>Reporter: Boaz Ben-Zvi
>Assignee: Abhishek Girish
>Priority: Critical
> Fix For: 1.17.0
>
>
>   A new clean build (e.g. after deleting the ~/.m2 local repository) would 
> fail now due to:  
> Access denied to: 
> [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=]
>  
> (e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )
> A new publicly available storage place is needed, plus appropriate changes in 
> Drill to get to these resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955982#comment-16955982
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

arina-ielchiieva commented on issue #1871: DRILL-7403: Validate batch checks, 
vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#issuecomment-544475715
 
 
   @paul-rogers sorry could not merge this commit, since there a couple of 
minor comments. Mostly I am worried about absent `print` method.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
> default. This enhancement adds more checks, including checks for row counts 
> (of which there are surprisingly many.)
> Since most operators will fail if the check is enabled, this enhancement also 
> adds a table to keep track of which operators pass the checks (and for which 
> checks should be enabled) and those that still need work. This allows the 
> checks to exist in the code, and to be enabled incrementally as we fix the 
> various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7412) Minor unit test improvements

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7412:

Reviewer: Arina Ielchiieva

> Minor unit test improvements
> 
>
> Key: DRILL-7412
> URL: https://issues.apache.org/jira/browse/DRILL-7412
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Many tests intentionally trigger errors. A debug-only log setting sent those 
> errors to stdout. The resulting stack dumps simply cluttered the test output, 
> so disabled error output to the console.
> Drill can apply bounds checks to vectors. Tests run via Maven enable bounds 
> checking. Now, bounds checking is also enabled in "debug mode" (when 
> assertions are enabled, as in an IDE.)
> Drill contains two test frameworks. The older BaseTestQuery was marked as 
> deprecated, but many tests still use it and are unlikely to be changed soon. 
> So, removed the deprecated marker to reduce the number of spurious warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7412) Minor unit test improvements

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7412:

Fix Version/s: 1.17.0

> Minor unit test improvements
> 
>
> Key: DRILL-7412
> URL: https://issues.apache.org/jira/browse/DRILL-7412
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
> Fix For: 1.17.0
>
>
> Many tests intentionally trigger errors. A debug-only log setting sent those 
> errors to stdout. The resulting stack dumps simply cluttered the test output, 
> so disabled error output to the console.
> Drill can apply bounds checks to vectors. Tests run via Maven enable bounds 
> checking. Now, bounds checking is also enabled in "debug mode" (when 
> assertions are enabled, as in an IDE.)
> Drill contains two test frameworks. The older BaseTestQuery was marked as 
> deprecated, but many tests still use it and are unlikely to be changed soon. 
> So, removed the deprecated marker to reduce the number of spurious warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7412) Minor unit test improvements

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955979#comment-16955979
 ] 

ASF GitHub Bot commented on DRILL-7412:
---

arina-ielchiieva commented on issue #1876: DRILL-7412: Minor unit test 
improvements
URL: https://github.com/apache/drill/pull/1876#issuecomment-544475117
 
 
   +1
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Minor unit test improvements
> 
>
> Key: DRILL-7412
> URL: https://issues.apache.org/jira/browse/DRILL-7412
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Many tests intentionally trigger errors. A debug-only log setting sent those 
> errors to stdout. The resulting stack dumps simply cluttered the test output, 
> so disabled error output to the console.
> Drill can apply bounds checks to vectors. Tests run via Maven enable bounds 
> checking. Now, bounds checking is also enabled in "debug mode" (when 
> assertions are enabled, as in an IDE.)
> Drill contains two test frameworks. The older BaseTestQuery was marked as 
> deprecated, but many tests still use it and are unlikely to be changed soon. 
> So, removed the deprecated marker to reduce the number of spurious warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7412) Minor unit test improvements

2019-10-21 Thread Arina Ielchiieva (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7412:

Labels: ready-to-commit  (was: )

> Minor unit test improvements
> 
>
> Key: DRILL-7412
> URL: https://issues.apache.org/jira/browse/DRILL-7412
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Many tests intentionally trigger errors. A debug-only log setting sent those 
> errors to stdout. The resulting stack dumps simply cluttered the test output, 
> so disabled error output to the console.
> Drill can apply bounds checks to vectors. Tests run via Maven enable bounds 
> checking. Now, bounds checking is also enabled in "debug mode" (when 
> assertions are enabled, as in an IDE.)
> Drill contains two test frameworks. The older BaseTestQuery was marked as 
> deprecated, but many tests still use it and are unlikely to be changed soon. 
> So, removed the deprecated marker to reduce the number of spurious warnings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955976#comment-16955976
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

arina-ielchiieva commented on pull request #1871: DRILL-7403: Validate batch 
checks, vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#discussion_r336965873
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java
 ##
 @@ -45,103 +49,373 @@
  */
 
 public class BatchValidator {
-  private static final org.slf4j.Logger logger =
-  org.slf4j.LoggerFactory.getLogger(BatchValidator.class);
+  private static final Logger logger = 
LoggerFactory.getLogger(BatchValidator.class);
 
+  public static final boolean LOG_TO_STDOUT = true;
   public static final int MAX_ERRORS = 100;
 
-  private final int rowCount;
-  private final VectorAccessible batch;
-  private final List errorList;
-  private int errorCount;
+  public interface ErrorReporter {
+void error(String name, ValueVector vector, String msg);
+void warn(String name, ValueVector vector, String msg);
+void error(String msg);
+int errorCount();
+  }
+
+  public abstract static class BaseErrorReporter implements ErrorReporter {
+
+private final String opName;
+private int errorCount;
+
+public BaseErrorReporter(String opName) {
+  this.opName = opName;
+}
+
+protected boolean startError() {
+  if (errorCount == 0) {
+warn("Found one or more vector errors from " + opName);
+  }
+  errorCount++;
+  if (errorCount >= MAX_ERRORS) {
+return false;
+  }
+  return true;
+}
+
+@Override
+public void error(String name, ValueVector vector, String msg) {
+  error(String.format("%s - %s: %s",
+name, vector.getClass().getSimpleName(), msg));
+}
+
+@Override
+public void warn(String name, ValueVector vector, String msg) {
+  warn(String.format("%s - %s: %s",
+name, vector.getClass().getSimpleName(), msg));
+}
+
+public abstract void warn(String msg);
+
+@Override
+public int errorCount() { return errorCount; }
+  }
+
+  private static class StdOutReporter extends BaseErrorReporter {
+
+public StdOutReporter(String opName) {
+  super(opName);
+}
+
+@Override
+public void error(String msg) {
+  if (startError()) {
+System.out.println(msg);
+  }
+}
+
+@Override
+public void warn(String msg) {
+  System.out.println(msg);
+}
+  }
+
+  private static class LogReporter extends BaseErrorReporter {
+
+public LogReporter(String opName) {
+  super(opName);
+}
+
+@Override
+public void error(String msg) {
+  if (startError()) {
+logger.error(msg);
+  }
+}
+
+@Override
+public void warn(String msg) {
+  logger.error(msg);
+}
+  }
+
+  private enum CheckMode { COUNTS, ALL };
+
+  private static final Map, CheckMode> checkRules 
= buildRules();
 
-  public BatchValidator(VectorAccessible batch) {
-rowCount = batch.getRecordCount();
-this.batch = batch;
-errorList = null;
+  private final ErrorReporter errorReporter;
+
+  public BatchValidator(ErrorReporter errorReporter) {
+this.errorReporter = errorReporter;
+  }
+
+  /**
+   * At present, most operators will not pass the checks here. The following
+   * table identifies those that should be checked, and the degree of check.
+   * Over time, this table should include all operators, and thus become
+   * unnecessary.
+   */
+  private static Map, CheckMode> buildRules() {
+final Map, CheckMode> rules = new 
IdentityHashMap<>();
+//rules.put(OperatorRecordBatch.class, CheckMode.ALL);
+return rules;
   }
 
-  public BatchValidator(VectorAccessible batch, boolean captureErrors) {
-rowCount = batch.getRecordCount();
-this.batch = batch;
-if (captureErrors) {
-  errorList = new ArrayList<>();
+  public static boolean validate(RecordBatch batch) {
+final CheckMode checkMode = checkRules.get(batch.getClass());
+
+// If no rule, don't check this batch.
+
+if (checkMode == null) {
+
+  // As work proceeds, might want to log those batches not checked.
+  // For now, there are too many.
+
+  return true;
+}
+
+// All batches that do any checks will at least check counts.
+
+final ErrorReporter reporter = errorReporter(batch);
+final int rowCount = batch.getRecordCount();
+int valueCount = rowCount;
+final VectorContainer container = batch.getContainer();
+if (!container.hasRecordCount()) {
+  reporter.error(String.format(
+  "%s: Container record count not set",
+  batch.getClass().getSimpleName()));
 } else {
-  errorList = null;
+  // Row count will = container count for most operators.
+  // Row c

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955974#comment-16955974
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

arina-ielchiieva commented on pull request #1871: DRILL-7403: Validate batch 
checks, vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#discussion_r336966111
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java
 ##
 @@ -150,57 +424,63 @@ private int validateOffsetVector(String name, 
UInt4Vector offsetVector, int valu
   error(name, offsetVector, "Offset (0) must be 0 but was " + prevOffset);
 }
 
-// Note <= comparison: offset vectors have (n+1) entries.
-
-for (int i = 1; i <= valueCount; i++) {
-  int offset = accessor.get(i);
+for (int i = 1; i < offsetCount; i++) {
+  final int offset = accessor.get(i);
   if (offset < prevOffset) {
-error(name, offsetVector, "Decreasing offsets at (" + (i-1) + ", " + i 
+ ") = (" + prevOffset + ", " + offset + ")");
+error(name, offsetVector, String.format(
+"Offset vector [%d] contained %d, expected >= %d",
+i, offset, prevOffset));
   } else if (offset > maxOffset) {
-error(name, offsetVector, "Invalid offset at index " + i + " = " + 
offset + " exceeds maximum of " + maxOffset);
+error(name, offsetVector, String.format(
+"Invalid offset at index %d: %d exceeds maximum of %d",
+i, offset, maxOffset));
   }
   prevOffset = offset;
 }
 return prevOffset;
   }
 
   private void error(String name, ValueVector vector, String msg) {
-if (errorCount == 0) {
-  logger.error("Found one or more vector errors from " + 
batch.getClass().getSimpleName());
-}
-errorCount++;
-if (errorCount >= MAX_ERRORS) {
-  return;
-}
-String fullMsg = "Column " + name + " of type " + 
vector.getClass().getSimpleName( ) + ": " + msg;
-logger.error(fullMsg);
-if (errorList != null) {
-  errorList.add(fullMsg);
-}
+errorReporter.error(name, vector, msg);
   }
 
-  private void validateNullableVector(String name, NullableVector vector) {
-// Can't validate at this time because the bits vector is in each
-// generated subtype.
-
-// Validate a VarChar vector because it is common.
-
-if (vector instanceof NullableVarCharVector) {
-  VarCharVector values = ((NullableVarCharVector) 
vector).getValuesVector();
-  validateVarCharVector(name + "-values", values, rowCount);
+  private void verifyIsSetVector(ValueVector parent, UInt1Vector bv) {
+final String name = String.format("%s (%s)-bits",
+parent.getField().getName(),
+parent.getClass().getSimpleName());
+final int rowCount = parent.getAccessor().getValueCount();
+final int bitCount = bv.getAccessor().getValueCount();
+if (bitCount != rowCount) {
+  error(name, bv, String.format(
+  "Value count = %d, but bit count = %d",
+  rowCount, bitCount));
+}
+final UInt1Vector.Accessor ba = bv.getAccessor();
+for (int i = 0; i < bitCount; i++) {
+  final int value = ba.get(i);
+  if (value != 0 && value != 1) {
+error(name, bv, String.format(
 
 Review comment:
   Looks like string format expecting 4 parameters, but only two are passed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> Drill provides a {{BatchValidator}} that checks vectors. It is disabled by 
> default. This enhancement adds more checks, including checks for row counts 
> (of which there are surprisingly many.)
> Since most operators will fail if the check is enabled, this enhancement also 
> adds a table to keep track of which operators pass the checks (and for which 
> checks should be enabled) and those that still need work. This allows the 
> checks to exist in the code, and to be enabled incrementally as we fix the 
> various problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955975#comment-16955975
 ] 

ASF GitHub Bot commented on DRILL-7403:
---

arina-ielchiieva commented on pull request #1871: DRILL-7403: Validate batch 
checks, vector integretity in unit tests
URL: https://github.com/apache/drill/pull/1871#discussion_r336966265
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java
 ##
 @@ -150,57 +424,63 @@ private int validateOffsetVector(String name, 
UInt4Vector offsetVector, int valu
   error(name, offsetVector, "Offset (0) must be 0 but was " + prevOffset);
 }
 
-// Note <= comparison: offset vectors have (n+1) entries.
-
-for (int i = 1; i <= valueCount; i++) {
-  int offset = accessor.get(i);
+for (int i = 1; i < offsetCount; i++) {
+  final int offset = accessor.get(i);
   if (offset < prevOffset) {
-error(name, offsetVector, "Decreasing offsets at (" + (i-1) + ", " + i 
+ ") = (" + prevOffset + ", " + offset + ")");
+error(name, offsetVector, String.format(
+"Offset vector [%d] contained %d, expected >= %d",
+i, offset, prevOffset));
   } else if (offset > maxOffset) {
-error(name, offsetVector, "Invalid offset at index " + i + " = " + 
offset + " exceeds maximum of " + maxOffset);
+error(name, offsetVector, String.format(
+"Invalid offset at index %d: %d exceeds maximum of %d",
+i, offset, maxOffset));
   }
   prevOffset = offset;
 }
 return prevOffset;
   }
 
   private void error(String name, ValueVector vector, String msg) {
-if (errorCount == 0) {
-  logger.error("Found one or more vector errors from " + 
batch.getClass().getSimpleName());
-}
-errorCount++;
-if (errorCount >= MAX_ERRORS) {
-  return;
-}
-String fullMsg = "Column " + name + " of type " + 
vector.getClass().getSimpleName( ) + ": " + msg;
-logger.error(fullMsg);
-if (errorList != null) {
-  errorList.add(fullMsg);
-}
+errorReporter.error(name, vector, msg);
   }
 
-  private void validateNullableVector(String name, NullableVector vector) {
-// Can't validate at this time because the bits vector is in each
-// generated subtype.
-
-// Validate a VarChar vector because it is common.
-
-if (vector instanceof NullableVarCharVector) {
-  VarCharVector values = ((NullableVarCharVector) 
vector).getValuesVector();
-  validateVarCharVector(name + "-values", values, rowCount);
+  private void verifyIsSetVector(ValueVector parent, UInt1Vector bv) {
+final String name = String.format("%s (%s)-bits",
+parent.getField().getName(),
+parent.getClass().getSimpleName());
+final int rowCount = parent.getAccessor().getValueCount();
+final int bitCount = bv.getAccessor().getValueCount();
+if (bitCount != rowCount) {
+  error(name, bv, String.format(
+  "Value count = %d, but bit count = %d",
+  rowCount, bitCount));
+}
+final UInt1Vector.Accessor ba = bv.getAccessor();
+for (int i = 0; i < bitCount; i++) {
+  final int value = ba.get(i);
+  if (value != 0 && value != 1) {
+error(name, bv, String.format(
+"%s %s: bit vector[%d] = %d, expected 0 or 1",
+i, value));
+  }
 }
-  }
-
-  private void validateFixedWidthVector(String name, FixedWidthVector vector) {
-// TODO Auto-generated method stub
-
   }
 
   /**
-   * Obtain the list of errors. For use in unit-testing this class.
-   * @return the list of errors found, or null if error capture was
-   * not enabled
+   * Print a record batch. Uses code only available in a test build.
+   * Classes are not visible to the compiler; must load dynamically.
+   * Does nothing if the class is not available.
*/
 
-  public List errors() { return errorList; }
+  public static void print(RecordBatch batch) {
+try {
+  final Class helper = 
Class.forName("org.apache.drill.test.rowSet.RowSetUtilities");
 
 Review comment:
   Checked `org.apache.drill.test.rowSet.RowSetUtilities` it does not have 
`print` method. Could you please check?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Validate batch checks, vector integretity in unit tests
> ---
>
> Key: DRILL-7403
> URL: https://issues.apache.org/jira/browse/DRILL-7403
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0, 1.17.0
>Rep

[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955965#comment-16955965
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

arina-ielchiieva commented on issue #1873: DRILL-6096: Provide mechanism to 
configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#issuecomment-544469217
 
 
   @paul-rogers thanks for the code review, addressed code review comments, 
force-pushed since there were minor changes in the code. 
   
   Regarding design, the aim of this Jira was just to fix text writer to write 
proper text files: before if column contained field separator, field was not 
enclosed in the quotes, thus we were writing text files which Drill could not 
read. Now when user indicates write format using session option (this is common 
approach for all formats), Drill produces text files, it can read back. 
Basically, if user has configured format plugin:
   ```
 "formats": {
   "csvh": {
 "type": "text",
 "extensions": [
   "csvh"
 ],
 "lineDelimiter": "\n",
 "fieldDelimiter": ",",
 "extractHeader": true
   }
  },
   ```
   Drill will be able to read and write such text files correctly. Same 
approach is used for `parquet`, `json`. All user needs to do is to indicate 
write format using session option: `alter session set `store.format` = 'csvh';` 
(`parquet`, `json`). I am not saying this is ideal and we might need to 
reconsider such writing approach but I guess not in the scope of Jira since 
such re-design would touch all file writers.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955957#comment-16955957
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

arina-ielchiieva commented on pull request #1873: DRILL-6096: Provide mechanism 
to configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#discussion_r336918266
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/writer/TextRecordWriter.java
 ##
 @@ -165,10 +182,12 @@ public void writeField() throws IOException {
   @Override
   public void cleanup() throws IOException {
 super.cleanup();
-if (stream != null) {
-  stream.close();
-  stream = null;
-  logger.debug("closing file");
+
+fRecordStarted = false;
+if (writer != null) {
+  writer.close();
 
 Review comment:
   Good point, will update the code. Catched `IllegalStateException` since 
`writer.close()` can throw only this expiation and wrapped it into 
`IOException` since `WriterRecordBatch#closeWriter` method handles properly 
this type of exception.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955901#comment-16955901
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

arina-ielchiieva commented on pull request #1873: DRILL-6096: Provide mechanism 
to configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#discussion_r336918266
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/writer/TextRecordWriter.java
 ##
 @@ -165,10 +182,12 @@ public void writeField() throws IOException {
   @Override
   public void cleanup() throws IOException {
 super.cleanup();
-if (stream != null) {
-  stream.close();
-  stream = null;
-  logger.debug("closing file");
+
+fRecordStarted = false;
+if (writer != null) {
+  writer.close();
 
 Review comment:
   Good point, will update the code.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955900#comment-16955900
 ] 

ASF GitHub Bot commented on DRILL-6096:
---

arina-ielchiieva commented on pull request #1873: DRILL-6096: Provide mechanism 
to configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#discussion_r336916683
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java
 ##
 @@ -57,10 +40,27 @@
 import org.apache.drill.exec.store.mock.MockStorageEngineConfig;
 import 
org.apache.drill.exec.store.sys.store.provider.ZookeeperPersistentStoreProvider;
 import org.apache.drill.exec.util.StoragePluginTestUtils;
-
 import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
 import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableMap;
 import org.apache.drill.shaded.guava.com.google.common.io.Resources;
+import org.apache.drill.test.DrillTestWrapper.TestServices;
+
+import java.io.File;
+import java.io.IOException;
+import java.net.URI;
+import java.net.URL;
+import java.nio.file.Paths;
+import java.sql.Connection;
+import java.sql.DriverManager;
+import java.sql.SQLException;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Properties;
 
 Review comment:
   I use standard IntelliJ IDEA imports order:
   
   https://user-images.githubusercontent.com/15086720/67193155-811dcc80-f3fd-11e9-8434-53b9a5f598cf.png";>
   
   Regarding updating the check styles, there is a Jira 
(https://issues.apache.org/jira/browse/DRILL-7352) where we can post comments.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> ---
>
> Key: DRILL-6096
> URL: https://issues.apache.org/jira/browse/DRILL-6096
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Text & CSV
>Affects Versions: 1.12.0
>Reporter: Kunal Khatua
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify the field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
> "write_text": {
>   "type": "text",
>   "extensions": [
> "txt"
>   ],
>   "lineDelimiter": "\n",
>   "fieldDelimiter": "!",
>   "quote": "^",
>   "escape": "^",
> }
>},
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files

2019-10-21 Thread Igor Guzenko (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Guzenko resolved DRILL-5183.
-
Resolution: Fixed

Fixed in DRILL-7268.

> Drill doesn't seem to handle array values correctly in Parquet files
> 
>
> Key: DRILL-5183
> URL: https://issues.apache.org/jira/browse/DRILL-5183
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Dave Kincaid
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: books.parquet
>
>
> It looks to me that Drill is not properly converting array values in Parquet 
> records. I have created a simple example and will attach a simple Parquet 
> file to this issue. If I write Parquet records using the Avro schema
> {code:title=Book.avsc}
> { "type": "record",
>   "name": "Book",
>   "fields": [
> { "name": "title", "type": "string" },
> { "name": "pages", "type": "int" },
> { "name": "authors", "type": {"type": "array", "items": "string"} }
>   ]
> }
> {code}
> I write two records using this schema into the attached Parquet file and then 
> simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result:
> ||title||pages||authors||
> |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}|
> |Foundations of Mathematical Analysis|428|{"array":["Richard 
> Johnsonbaugh","W.E. Pfaffenberger"]}|
> You can see that the authors column seems to be a nested record with the name 
> "array" instead of being a repeated value. If I change the SQL query to 
> {{SELECT title,pages,t.authors.`array` FROM 
> dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then 
> I get:
> ||title||pages||EXPR$2||
> |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]|
> |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. 
> Pfaffenberger"]|
> and now that column behaves in Drill as a repeated values column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files

2019-10-21 Thread Igor Guzenko (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Guzenko updated DRILL-5183:

Fix Version/s: 1.17.0

> Drill doesn't seem to handle array values correctly in Parquet files
> 
>
> Key: DRILL-5183
> URL: https://issues.apache.org/jira/browse/DRILL-5183
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Dave Kincaid
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: books.parquet
>
>
> It looks to me that Drill is not properly converting array values in Parquet 
> records. I have created a simple example and will attach a simple Parquet 
> file to this issue. If I write Parquet records using the Avro schema
> {code:title=Book.avsc}
> { "type": "record",
>   "name": "Book",
>   "fields": [
> { "name": "title", "type": "string" },
> { "name": "pages", "type": "int" },
> { "name": "authors", "type": {"type": "array", "items": "string"} }
>   ]
> }
> {code}
> I write two records using this schema into the attached Parquet file and then 
> simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result:
> ||title||pages||authors||
> |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}|
> |Foundations of Mathematical Analysis|428|{"array":["Richard 
> Johnsonbaugh","W.E. Pfaffenberger"]}|
> You can see that the authors column seems to be a nested record with the name 
> "array" instead of being a repeated value. If I change the SQL query to 
> {{SELECT title,pages,t.authors.`array` FROM 
> dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then 
> I get:
> ||title||pages||EXPR$2||
> |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]|
> |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. 
> Pfaffenberger"]|
> and now that column behaves in Drill as a repeated values column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files

2019-10-21 Thread Igor Guzenko (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Guzenko reassigned DRILL-5183:
---

Assignee: Igor Guzenko

> Drill doesn't seem to handle array values correctly in Parquet files
> 
>
> Key: DRILL-5183
> URL: https://issues.apache.org/jira/browse/DRILL-5183
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Dave Kincaid
>Assignee: Igor Guzenko
>Priority: Major
> Attachments: books.parquet
>
>
> It looks to me that Drill is not properly converting array values in Parquet 
> records. I have created a simple example and will attach a simple Parquet 
> file to this issue. If I write Parquet records using the Avro schema
> {code:title=Book.avsc}
> { "type": "record",
>   "name": "Book",
>   "fields": [
> { "name": "title", "type": "string" },
> { "name": "pages", "type": "int" },
> { "name": "authors", "type": {"type": "array", "items": "string"} }
>   ]
> }
> {code}
> I write two records using this schema into the attached Parquet file and then 
> simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result:
> ||title||pages||authors||
> |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}|
> |Foundations of Mathematical Analysis|428|{"array":["Richard 
> Johnsonbaugh","W.E. Pfaffenberger"]}|
> You can see that the authors column seems to be a nested record with the name 
> "array" instead of being a repeated value. If I change the SQL query to 
> {{SELECT title,pages,t.authors.`array` FROM 
> dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then 
> I get:
> ||title||pages||EXPR$2||
> |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]|
> |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. 
> Pfaffenberger"]|
> and now that column behaves in Drill as a repeated values column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-1999) Drill should expose the Parquet logical schema rather than the physical schema

2019-10-21 Thread Igor Guzenko (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Guzenko updated DRILL-1999:

Fix Version/s: (was: Future)
   1.17.0

> Drill should expose the Parquet logical schema rather than the physical schema
> --
>
> Key: DRILL-1999
> URL: https://issues.apache.org/jira/browse/DRILL-1999
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Ramana Inukonda Nagaraj
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: 1.17.0
>
> Attachments: hive_alltypes.parquet
>
>
> Created a parquet file in hive having the following DDL
> hive> desc alltypesparquet; 
> OK
> c1 int 
> c2 boolean 
> c3 double 
> c4 string 
> c5 array 
> c6 map 
> c7 map 
> c8 struct
> c9 tinyint 
> c10 smallint 
> c11 float 
> c12 bigint 
> c13 array>  
> c15 struct>
> c16 array,n:int>> 
> Time taken: 0.076 seconds, Fetched: 15 row(s)
> column5 which is an array of integers shows up as a bag when querying through 
> drill 
> 0: jdbc:drill:> select c5 from `/user/hive/warehouse/alltypesparquet`;
> ++
> | c5 |
> ++
> | {"bag":[]} |
> | {"bag":[]} |
> | {"bag":[{"array_element":1},{"array_element":2}]} |
> ++
> 3 rows selected (0.085 seconds)
> While from hive
> hive> select c5 from alltypesparquet;  
> OK
> NULL
> NULL
> [1,2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (DRILL-1999) Drill should expose the Parquet logical schema rather than the physical schema

2019-10-21 Thread Igor Guzenko (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Guzenko resolved DRILL-1999.
-
Resolution: Fixed

Fixed in scope of DRILL-7268. 

> Drill should expose the Parquet logical schema rather than the physical schema
> --
>
> Key: DRILL-1999
> URL: https://issues.apache.org/jira/browse/DRILL-1999
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Ramana Inukonda Nagaraj
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: Future
>
> Attachments: hive_alltypes.parquet
>
>
> Created a parquet file in hive having the following DDL
> hive> desc alltypesparquet; 
> OK
> c1 int 
> c2 boolean 
> c3 double 
> c4 string 
> c5 array 
> c6 map 
> c7 map 
> c8 struct
> c9 tinyint 
> c10 smallint 
> c11 float 
> c12 bigint 
> c13 array>  
> c15 struct>
> c16 array,n:int>> 
> Time taken: 0.076 seconds, Fetched: 15 row(s)
> column5 which is an array of integers shows up as a bag when querying through 
> drill 
> 0: jdbc:drill:> select c5 from `/user/hive/warehouse/alltypesparquet`;
> ++
> | c5 |
> ++
> | {"bag":[]} |
> | {"bag":[]} |
> | {"bag":[{"array_element":1},{"array_element":2}]} |
> ++
> 3 rows selected (0.085 seconds)
> While from hive
> hive> select c5 from alltypesparquet;  
> OK
> NULL
> NULL
> [1,2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (DRILL-1999) Drill should expose the Parquet logical schema rather than the physical schema

2019-10-21 Thread Igor Guzenko (Jira)



 [ 
https://issues.apache.org/jira/browse/DRILL-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Guzenko reassigned DRILL-1999:
---

Assignee: Igor Guzenko

> Drill should expose the Parquet logical schema rather than the physical schema
> --
>
> Key: DRILL-1999
> URL: https://issues.apache.org/jira/browse/DRILL-1999
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Reporter: Ramana Inukonda Nagaraj
>Assignee: Igor Guzenko
>Priority: Major
> Fix For: Future
>
> Attachments: hive_alltypes.parquet
>
>
> Created a parquet file in hive having the following DDL
> hive> desc alltypesparquet; 
> OK
> c1 int 
> c2 boolean 
> c3 double 
> c4 string 
> c5 array 
> c6 map 
> c7 map 
> c8 struct
> c9 tinyint 
> c10 smallint 
> c11 float 
> c12 bigint 
> c13 array>  
> c15 struct>
> c16 array,n:int>> 
> Time taken: 0.076 seconds, Fetched: 15 row(s)
> column5 which is an array of integers shows up as a bag when querying through 
> drill 
> 0: jdbc:drill:> select c5 from `/user/hive/warehouse/alltypesparquet`;
> ++
> | c5 |
> ++
> | {"bag":[]} |
> | {"bag":[]} |
> | {"bag":[{"array_element":1},{"array_element":2}]} |
> ++
> 3 rows selected (0.085 seconds)
> While from hive
> hive> select c5 from alltypesparquet;  
> OK
> NULL
> NULL
> [1,2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955890#comment-16955890
 ] 

ASF GitHub Bot commented on DRILL-7405:
---

vvysotskyi commented on issue #1874: DRILL-7405: Avoiding download of TPC-H data
URL: https://github.com/apache/drill/pull/1874#issuecomment-544422550
 
 
   @paul-rogers, yes, these files are used in unit tests mostly in the 
`java-exec` module. Currently, `contrib/data/tpch-sample-data` is built before 
`exec/Java Execution Engine` so there shouldn't be any problems.
   
   The main reason for my proposal to use JitPack was to preserve existing 
behavior, and as the side effects to avoid expanding project source files size 
and do not complicate the version control system life.
   
   If most of the opinions agreed that these files are small enough, let's 
continue with the current approach.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Build fails due to inaccessible apache-drill on S3 storage
> --
>
> Key: DRILL-7405
> URL: https://issues.apache.org/jira/browse/DRILL-7405
> Project: Apache Drill
>  Issue Type: Task
>  Components: Tools, Build & Test
>Affects Versions: 1.16.0
>Reporter: Boaz Ben-Zvi
>Assignee: Abhishek Girish
>Priority: Critical
> Fix For: 1.17.0
>
>
>   A new clean build (e.g. after deleting the ~/.m2 local repository) would 
> fail now due to:  
> Access denied to: 
> [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=]
>  
> (e.g., for the test data  sf-0.01_tpc-h_parquet_typed.tgz )
> A new publicly available storage place is needed, plus appropriate changes in 
> Drill to get to these resources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7177) Format Plugin for Excel Files

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955880#comment-16955880
 ] 

ASF GitHub Bot commented on DRILL-7177:
---

arina-ielchiieva commented on pull request #1749: DRILL-7177: Format Plugin for 
Excel Files
URL: https://github.com/apache/drill/pull/1749#discussion_r336901086
 
 

 ##
 File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
 ##
 @@ -0,0 +1,444 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.excel;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.MetadataUtils;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.CellValue;
+import org.apache.poi.ss.usermodel.DateUtil;
+import org.apache.poi.ss.usermodel.FormulaEvaluator;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.xssf.usermodel.XSSFSheet;
+import org.apache.poi.xssf.usermodel.XSSFWorkbook;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.joda.time.Instant;
+import java.util.Iterator;
+import java.io.IOException;
+import java.util.ArrayList;
+
+public class ExcelBatchReader implements ManagedReader {
+  private ExcelReaderConfig readerConfig;
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
+
+  private static final String SAFE_WILDCARD = "_$";
+
+  private static final String SAFE_SEPARATOR = "_";
+
+  private static final String PARSER_WILDCARD = ".*";
+
+  private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
+
+  private static final String MISSING_FIELD_NAME_HEADER = "field_";
+
+  private XSSFSheet sheet;
+
+  private XSSFWorkbook workbook;
+
+  private FSDataInputStream fsStream;
 
 Review comment:
   ```suggestion
 private InputStream fsStream;
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Format Plugin for Excel Files
> -
>
> Key: DRILL-7177
> URL: https://issues.apache.org/jira/browse/DRILL-7177
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.17.0
>
>
> This pull request adds the functionality which enables Drill to query 
> Microsoft Excel files. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (DRILL-7177) Format Plugin for Excel Files

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955879#comment-16955879
 ] 

ASF GitHub Bot commented on DRILL-7177:
---

arina-ielchiieva commented on pull request #1749: DRILL-7177: Format Plugin for 
Excel Files
URL: https://github.com/apache/drill/pull/1749#discussion_r336900967
 
 

 ##
 File path: 
contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java
 ##
 @@ -0,0 +1,444 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.excel;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.MetadataUtils;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.poi.ss.usermodel.Cell;
+import org.apache.poi.ss.usermodel.CellValue;
+import org.apache.poi.ss.usermodel.DateUtil;
+import org.apache.poi.ss.usermodel.FormulaEvaluator;
+import org.apache.poi.ss.usermodel.Row;
+import org.apache.poi.xssf.usermodel.XSSFSheet;
+import org.apache.poi.xssf.usermodel.XSSFWorkbook;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.joda.time.Instant;
+import java.util.Iterator;
+import java.io.IOException;
+import java.util.ArrayList;
+
+public class ExcelBatchReader implements ManagedReader {
+  private ExcelReaderConfig readerConfig;
+
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class);
+
+  private static final String SAFE_WILDCARD = "_$";
+
+  private static final String SAFE_SEPARATOR = "_";
+
+  private static final String PARSER_WILDCARD = ".*";
+
+  private static final String HEADER_NEW_LINE_REPLACEMENT = "__";
+
+  private static final String MISSING_FIELD_NAME_HEADER = "field_";
+
+  private XSSFSheet sheet;
+
+  private XSSFWorkbook workbook;
+
+  private FSDataInputStream fsStream;
+
+  private FormulaEvaluator evaluator;
+
+  private ArrayList excelFieldNames;
+
+  private ArrayList columnWriters;
+
+  private Iterator rowIterator;
+
+  private RowSetLoader rowWriter;
+
+  private int totalColumnCount;
+
+  private int lineCount;
+
+  private boolean firstLine;
+
+  private FileSplit split;
+
+  private ResultSetLoader loader;
+
+  private int recordCount;
+
+  public static class ExcelReaderConfig {
+protected final ExcelFormatPlugin plugin;
+
+protected final int headerRow;
+
+protected final int lastRow;
+
+protected final int firstColumn;
+
+protected final int lastColumn;
+
+protected final boolean readAllFieldsAsVarChar;
+
+protected String sheetName;
+
+public ExcelReaderConfig(ExcelFormatPlugin plugin) {
+  this.plugin = plugin;
+  headerRow = plugin.getConfig().getHeaderRow();
+  lastRow = plugin.getConfig().getLastRow();
+  firstColumn = plugin.getConfig().getFirstColumn();
+  lastColumn = plugin.getConfig().getLastColumn();
+  readAllFieldsAsVarChar = plugin.getConfig().getReadAllFieldsAsVarChar();
+  sheetName = plugin.getConfig().getSheetName();
+}
+  }
+
+  public ExcelBatchReader(ExcelReaderConfig readerConfig) {
+this.readerConfig = readerConfig;
+firstLine = true;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+verifyConfigOptions();
+split = negotiator.split();
+loader = negotiator.build();
+rowWriter = loa

[jira] [Commented] (DRILL-4303) ESRI Shapefile (shp) format plugin

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955877#comment-16955877
 ] 

ASF GitHub Bot commented on DRILL-4303:
---

arina-ielchiieva commented on pull request #1858: DRILL-4303: ESRI Shapefile 
(shp) Format Plugin
URL: https://github.com/apache/drill/pull/1858#discussion_r336899443
 
 

 ##
 File path: 
contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
 ##
 @@ -0,0 +1,334 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.esri;
+
+import com.esri.core.geometry.Geometry;
+import com.esri.core.geometry.GeometryCursor;
+import com.esri.core.geometry.ShapefileReader;
+import com.esri.core.geometry.SpatialReference;
+import com.esri.core.geometry.ogc.OGCGeometry;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.MetadataUtils;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.FileSplit;
+import org.jamel.dbf.DbfReader;
+import org.jamel.dbf.structure.DbfField;
+import org.joda.time.Instant;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.charset.Charset;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class ShpBatchReader implements ManagedReader {
+
+  private FileSplit split;
+  private BufferedReader reader;
+  private ResultSetLoader loader;
+  private ShpReaderConfig readerConfig;
+  private Path hadoopShp;
+  private Path hadoopDbf;
+  private Path hadoopPrj;
+  private FSDataInputStream fileReaderShp = null;
+  private FSDataInputStream fileReaderDbf = null;
+  private FSDataInputStream fileReaderPrj = null;
+  private GeometryCursor geomCursor = null;
+  private DbfReader dbfReader = null;
+  private ScalarWriter gidWriter;
+  private ScalarWriter sridWriter;
+  private ScalarWriter shapeTypeWriter;
+  private ScalarWriter geomWriter;
+  private RowSetLoader rowWriter;
+
+
+  private int srid;
+  private SpatialReference spatialReference;
+  private static final Logger logger = 
LoggerFactory.getLogger(ShpBatchReader.class);
+
+  public static class ShpReaderConfig {
+protected final ShpFormatPlugin plugin;
+
+public ShpReaderConfig(ShpFormatPlugin plugin) {
+  this.plugin = plugin;
+}
+  }
+
+  public ShpBatchReader(ShpReaderConfig readerConfig) {
+this.readerConfig = readerConfig;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+this.split = negotiator.split();
+this.hadoopShp = split.getPath();
+this.hadoopDbf = new Path(split.getPath().toString().replace("shp", 
"dbf"));
+this.hadoopPrj = new Path(split.getPath().toString().replace("shp", 
"prj"));
+
+openFile(negotiator);
+SchemaBuilder builder = new SchemaBuilder();
+builder.addNullable("gid", TypeProtos.MinorType.INT);
+builder.addNullable("srid", TypeProtos.MinorType.INT);
+builder.addNullable("shapeType", TypeProtos.MinorType.VARCHAR);
+builder.addNullable("geom", TypeProtos.MinorType.VARBINARY);
+
+negotiator.setTableSchema(builder.buildSchema(), false);
+loader = negotiator.build();
+
+rowWriter = loader.writer();
+gidWriter = rowWriter.scalar("gid");
+sridWriter = rowWriter.scalar("srid");
+shapeTypeWriter = rowWriter.scalar("shapeType");
+geomWriter = rowWriter.scalar("geom");
+
+return true;
+  }

[jira] [Commented] (DRILL-4303) ESRI Shapefile (shp) format plugin

2019-10-21 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/DRILL-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955876#comment-16955876
 ] 

ASF GitHub Bot commented on DRILL-4303:
---

arina-ielchiieva commented on pull request #1858: DRILL-4303: ESRI Shapefile 
(shp) Format Plugin
URL: https://github.com/apache/drill/pull/1858#discussion_r336899443
 
 

 ##
 File path: 
contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java
 ##
 @@ -0,0 +1,334 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.esri;
+
+import com.esri.core.geometry.Geometry;
+import com.esri.core.geometry.GeometryCursor;
+import com.esri.core.geometry.ShapefileReader;
+import com.esri.core.geometry.SpatialReference;
+import com.esri.core.geometry.ogc.OGCGeometry;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.types.TypeProtos;
+import 
org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator;
+import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader;
+import org.apache.drill.exec.physical.resultSet.ResultSetLoader;
+import org.apache.drill.exec.physical.resultSet.RowSetLoader;
+import org.apache.drill.exec.record.metadata.ColumnMetadata;
+import org.apache.drill.exec.record.metadata.MetadataUtils;
+import org.apache.drill.exec.record.metadata.SchemaBuilder;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.mapred.FileSplit;
+import org.jamel.dbf.DbfReader;
+import org.jamel.dbf.structure.DbfField;
+import org.joda.time.Instant;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.charset.Charset;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class ShpBatchReader implements ManagedReader {
+
+  private FileSplit split;
+  private BufferedReader reader;
+  private ResultSetLoader loader;
+  private ShpReaderConfig readerConfig;
+  private Path hadoopShp;
+  private Path hadoopDbf;
+  private Path hadoopPrj;
+  private FSDataInputStream fileReaderShp = null;
+  private FSDataInputStream fileReaderDbf = null;
+  private FSDataInputStream fileReaderPrj = null;
+  private GeometryCursor geomCursor = null;
+  private DbfReader dbfReader = null;
+  private ScalarWriter gidWriter;
+  private ScalarWriter sridWriter;
+  private ScalarWriter shapeTypeWriter;
+  private ScalarWriter geomWriter;
+  private RowSetLoader rowWriter;
+
+
+  private int srid;
+  private SpatialReference spatialReference;
+  private static final Logger logger = 
LoggerFactory.getLogger(ShpBatchReader.class);
+
+  public static class ShpReaderConfig {
+protected final ShpFormatPlugin plugin;
+
+public ShpReaderConfig(ShpFormatPlugin plugin) {
+  this.plugin = plugin;
+}
+  }
+
+  public ShpBatchReader(ShpReaderConfig readerConfig) {
+this.readerConfig = readerConfig;
+  }
+
+  @Override
+  public boolean open(FileSchemaNegotiator negotiator) {
+this.split = negotiator.split();
+this.hadoopShp = split.getPath();
+this.hadoopDbf = new Path(split.getPath().toString().replace("shp", 
"dbf"));
+this.hadoopPrj = new Path(split.getPath().toString().replace("shp", 
"prj"));
+
+openFile(negotiator);
+SchemaBuilder builder = new SchemaBuilder();
+builder.addNullable("gid", TypeProtos.MinorType.INT);
+builder.addNullable("srid", TypeProtos.MinorType.INT);
+builder.addNullable("shapeType", TypeProtos.MinorType.VARCHAR);
+builder.addNullable("geom", TypeProtos.MinorType.VARBINARY);
+
+negotiator.setTableSchema(builder.buildSchema(), false);
+loader = negotiator.build();
+
+rowWriter = loader.writer();
+gidWriter = rowWriter.scalar("gid");
+sridWriter = rowWriter.scalar("srid");
+shapeTypeWriter = rowWriter.scalar("shapeType");
+geomWriter = rowWriter.scalar("geom");
+
+return true;
+  }

69 matches

Mail list logo