[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter
[ https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956671#comment-16956671 ] ASF GitHub Bot commented on DRILL-6096: --- paul-rogers commented on issue #1873: DRILL-6096: Provide mechanism to configure text writer configuration URL: https://github.com/apache/drill/pull/1873#issuecomment-544797392 To answer my other comment; about the lack of ease-of-use with the current session options related to file formats: the answer is the schema mechanism you defined a while back. Even if no schema is provided, we should allow the user to override formatting properties using the table options encoded in that schema. Then, I can write a file using, say pipe delimiters, have that be recorded, and read the file automatically using those delimiters. Probably work to be done, but would we a nice solution that does not require HMS. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide mechanisms to specify field delimiters and quoted text for > TextRecordWriter > --- > > Key: DRILL-6096 > URL: https://issues.apache.org/jira/browse/DRILL-6096 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.12.0 >Reporter: Kunal Khatua >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.17.0 > > > Currently, there is no way for a user to specify the field delimiter for the > writing records as a text output. Further more, if the fields contain the > delimiter, we have no mechanism of specifying quotes. > By default, quotes should be used to enclose non-numeric fields being written. > *Description of the implemented changes:* > 2 options are added to control text writer output: > {{store.text.writer.add_header}} - indicates if header should be added in > created text file. Default is true. > {{store.text.writer.force_quotes}} - indicates if all value should be quoted. > Default is false. It means only values that contain special characters (line > / field separators) will be quoted. > Line / field separators, quote / escape characters can be configured using > text format configuration using Web UI. User can create special format only > for writing data and then use it when creating files. Though such format can > be always used to read back written data. > {noformat} > "formats": { > "write_text": { > "type": "text", > "extensions": [ > "txt" > ], > "lineDelimiter": "\n", > "fieldDelimiter": "!", > "quote": "^", > "escape": "^", > } >}, > ... > {noformat} > Next set specified format and create text file: > {noformat} > alter session set `store.format` = 'write_text'; > create table dfs.tmp.t as select 1 as id from (values(1)); > {noformat} > Notes: > 1. To write data univocity-parsers are used, they limit line separator length > to not more than 2 characters, though Drill allows setting more 2 chars as > line separator since Drill can read data splitting by line separator of any > length, during data write exception will be thrown. > 2. {{extractHeader}} in text format configuration does not affect if header > will be written to text file, only {{store.text.writer.add_header}} controls > this action. {{extractHeader}} is used only when reading the data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent
[ https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956669#comment-16956669 ] Paul Rogers commented on DRILL-7352: Start with the [existing set of rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/]. * Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at the top. * Use `final` aggressively on fields, do not use it on local variables or parameters. Once decisions are finalized, update the format files for Eclipse and IntelliJ. > Introduce new checkstyle rules to make code style more consistent > - > > Key: DRILL-7352 > URL: https://issues.apache.org/jira/browse/DRILL-7352 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Priority: Major > Fix For: 1.17.0 > > > Source - https://checkstyle.sourceforge.io/checks.html > List of rules to be enabled: > * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] > - force placement of a left curly brace at the end of the line. > * > [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] > - force placement of a right curly brace > * > [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile] > * > [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses] > * > [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad] > * [InnerTypeLast > |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast] > * > [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride] > * > [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition] > * > [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle] > * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll] > and others -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956659#comment-16956659 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337321627 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatSelection.java ## @@ -63,6 +60,6 @@ public FileSelection getSelection(){ @JsonIgnore public boolean supportDirPruning() { Review comment: As above `support` --> `supports`. Is safe because this value is not serialized. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956661#comment-16956661 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337322883 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/PcapngFormatPlugin.java ## @@ -47,7 +47,7 @@ public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fs public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, PcapngFormatConfig formatPluginConfig) { super(name, context, fsConf, config, formatPluginConfig, true, -false, true, false, +false, true, true, Review comment: Isn't the middle `true` wrong? It is for `blockSplittable`. That means we'll start reading at an arbitrary block boundary. Since this is a binary format, it is not clear that we can scan forward to the beginning of the next record as can be done in Sequence File and (restricted) CSV. Also, if the file is zip-encoded, then it is never block splittable since Zip files cannot be read at an arbitrary offset. This creates an issue: the block-splittable attribute right now is a constant. But, if any file is zip-encoded, then it is never block splittable. Any way to handle this fact? And, any way to test this behaviour? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956657#comment-16956657 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337321291 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java ## @@ -386,17 +387,16 @@ public static void checkBackPaths(String parent, String combinedPath, String sub Preconditions.checkArgument(!combinedPath.isEmpty(), "Empty path (" + combinedPath + "( in file selection path."); if (!combinedPath.startsWith(parent)) { - StringBuilder msg = new StringBuilder(); - msg.append("Invalid path : ").append(subpath).append(" takes you outside the workspace."); - throw new IllegalArgumentException(msg.toString()); + throw new IllegalArgumentException( +String.format("Invalid path [%s] takes you outside the workspace.", subpath)); } } public List getFileStatuses() { return statuses; } - public boolean supportDirPrunig() { + public boolean supportDirPruning() { Review comment: Good catch. `suppportsDirPruning` (with an s)? The `support` form is imperative, it tells this object to support dir pruning. The `supports` form asks if this object does or does not support dir pruning. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956662#comment-16956662 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337322300 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java ## @@ -59,33 +51,44 @@ import org.apache.drill.exec.store.dfs.FormatSelection; import org.apache.drill.exec.store.dfs.MagicString; import org.apache.drill.exec.store.dfs.MetadataContext; -import org.apache.drill.exec.store.mock.MockStorageEngine; import org.apache.drill.exec.store.parquet.metadata.Metadata; import org.apache.drill.exec.store.parquet.metadata.ParquetTableMetadataDirs; import org.apache.drill.exec.util.DrillFileSystemUtil; import org.apache.drill.shaded.guava.com.google.common.base.Stopwatch; import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableSet; -import org.apache.drill.shaded.guava.com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.parquet.format.converter.ParquetMetadataConverter; import org.apache.parquet.hadoop.ParquetFileWriter; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.TimeUnit; +import java.util.regex.Pattern; Review comment: Maybe change your IDE import order to put java above org? That way, there won't be constant import shuffling each time your IDE touches a file. (Yes, we should decide on preferred order and document it somewhere...) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956660#comment-16956660 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337321551 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemPlugin.java ## @@ -57,7 +61,9 @@ */ public class FileSystemPlugin extends AbstractStoragePlugin { - private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemPlugin.class); + private static final Logger logger = LoggerFactory.getLogger(FileSystemPlugin.class); + + private static final List BUILT_IN_CODECS = Collections.singletonList(ZipCodec.class.getCanonicalName()); Review comment: Are no other codecs provided "out of the box"? For others, I need to provide a jar and set a config option? Or, should we move the other built-in ones here and out of the config file? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956664#comment-16956664 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337321942 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.dfs; + +import org.apache.hadoop.io.compress.CompressionInputStream; +import org.apache.hadoop.io.compress.CompressionOutputStream; +import org.apache.hadoop.io.compress.DefaultCodec; + +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; +import java.util.zip.ZipOutputStream; + +/** + * ZIP codec implementation which cna read or create single entry. + * + * Note: Do not rename this class. Class naming must be 'ZipCodec' so it can be mapped by + * {@link org.apache.hadoop.io.compress.CompressionCodecFactory} to the 'zip' extension. + */ +public class ZipCodec extends DefaultCodec { + + private static final String EXTENSION = ".zip"; + + @Override + public CompressionOutputStream createOutputStream(OutputStream out) throws IOException { +return new ZipCompressionOutputStream(new ResetableZipOutputStream(out)); + } + + @Override + public CompressionInputStream createInputStream(InputStream in) throws IOException { +return new ZipCompressionInputStream(new ZipInputStream(in)); + } + + @Override + public String getDefaultExtension() { +return EXTENSION; + } + + /** + * Reads only first entry from {@link ZipInputStream}, + * other entries if present will be ignored. + */ + private static class ZipCompressionInputStream extends CompressionInputStream { + +ZipCompressionInputStream(ZipInputStream in) throws IOException { + super(in); + // positions stream at the beginning of the first entry data + in.getNextEntry(); +} + +@Override +public int read() throws IOException { + return in.read(); +} + +@Override +public int read(byte[] b, int off, int len) throws IOException { + return in.read(b, off, len); +} + +@Override +public void resetState() throws IOException { + in.reset(); +} + +@Override +public void close() throws IOException { + try { +((ZipInputStream) in).closeEntry(); + } finally { +super.close(); + } +} + } + + /** + * Extends {@link ZipOutputStream} to allow resetting compressor stream, + * required by {@link CompressionOutputStream} implementation. + */ + private static class ResetableZipOutputStream extends ZipOutputStream { + +ResetableZipOutputStream(OutputStream out) { + super(out); +} + +void resetState() { + def.reset(); +} + } + + /** + * Writes given data into ZIP archive by placing all data in one entry with default naming. + */ + private static class ZipCompressionOutputStream extends CompressionOutputStream { + +private static final String DEFAULT_ENTRY_NAME = "entry.out"; Review comment: Should the entry name be the same as the file name so it is sensible if someone unzips the file? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956663#comment-16956663 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337322969 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/package-info.java ## @@ -16,7 +16,7 @@ * limitations under the License. */ /** - * For comments on realization of this format plugin look at : + * For comments on implementation of this format plugin look at: Review comment: "look at" --> "see" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956658#comment-16956658 ] ASF GitHub Bot commented on DRILL-5674: --- paul-rogers commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879#discussion_r337321743 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java ## @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.dfs; + +import org.apache.hadoop.io.compress.CompressionInputStream; +import org.apache.hadoop.io.compress.CompressionOutputStream; +import org.apache.hadoop.io.compress.DefaultCodec; + +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.util.zip.ZipEntry; +import java.util.zip.ZipInputStream; +import java.util.zip.ZipOutputStream; + +/** + * ZIP codec implementation which cna read or create single entry. + * + * Note: Do not rename this class. Class naming must be 'ZipCodec' so it can be mapped by + * {@link org.apache.hadoop.io.compress.CompressionCodecFactory} to the 'zip' extension. + */ +public class ZipCodec extends DefaultCodec { + + private static final String EXTENSION = ".zip"; Review comment: Any need to support G-zip? (`.gz`) or Tar/g-zip (`.tar.gz`)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956646#comment-16956646 ] ASF GitHub Bot commented on DRILL-7414: --- paul-rogers commented on pull request #1878: DRILL-7414: EVF incorrectly sets buffer writer index after rollover URL: https://github.com/apache/drill/pull/1878#discussion_r337311938 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/resultSet/impl/TestResultSetLoaderOverflow.java ## @@ -59,55 +61,64 @@ @Test public void testVectorSizeLimit() { -TupleMetadata schema = new SchemaBuilder() +final TupleMetadata schema = new SchemaBuilder() Review comment: Sorry, too much functional programming with Scala in my "real job"; have gotten used to marking variables 'val` rather than `var`. `final` is the Java equivalent. But, since Drill does not normally use this convention, removed the unneeded `final` keywords. You are right; if there is a performance benefit, the compiler will figure out that the variable is never modified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956647#comment-16956647 ] ASF GitHub Bot commented on DRILL-7414: --- paul-rogers commented on pull request #1878: DRILL-7414: EVF incorrectly sets buffer writer index after rollover URL: https://github.com/apache/drill/pull/1878#discussion_r337312721 ## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/OffsetVectorWriterImpl.java ## @@ -290,7 +290,7 @@ public void preRollover() { // rows. But, this being an offset vector, we add one to account // for the extra 0 value at the start. -setValueCount(vectorIndex.rowStartIndex() + 1); +setValueCount(vectorIndex.rowStartIndex()); Review comment: Updated the comment as it was too subtle. The row start index is already at the proper index, it points past the last valid value (it points to where we'd add the next value, if we had one.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956609#comment-16956609 ] ASF GitHub Bot commented on DRILL-7403: --- paul-rogers commented on issue #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#issuecomment-544779743 Thanks much for the review! Made requested changes. Rebased on master. Squashed commits. Once this is merged, I'll update the two new PRs to eliminate the commits duplicated with this one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Validate batch checks, vector integretity in unit tests > --- > > Key: DRILL-7403 > URL: https://issues.apache.org/jira/browse/DRILL-7403 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0, 1.17.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Drill provides a {{BatchValidator}} that checks vectors. It is disabled by > default. This enhancement adds more checks, including checks for row counts > (of which there are surprisingly many.) > Since most operators will fail if the check is enabled, this enhancement also > adds a table to keep track of which operators pass the checks (and for which > checks should be enabled) and those that still need work. This allows the > checks to exist in the code, and to be enabled incrementally as we fix the > various problems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956608#comment-16956608 ] ASF GitHub Bot commented on DRILL-7403: --- paul-rogers commented on issue #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#issuecomment-544779743 Made requested changes. Rebased on master. Squashed commits. Once this is merged, I'll update the two new PRs to eliminate the commits duplicated with this one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Validate batch checks, vector integretity in unit tests > --- > > Key: DRILL-7403 > URL: https://issues.apache.org/jira/browse/DRILL-7403 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0, 1.17.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Drill provides a {{BatchValidator}} that checks vectors. It is disabled by > default. This enhancement adds more checks, including checks for row counts > (of which there are surprisingly many.) > Since most operators will fail if the check is enabled, this enhancement also > adds a table to keep track of which operators pass the checks (and for which > checks should be enabled) and those that still need work. This allows the > checks to exist in the code, and to be enabled incrementally as we fix the > various problems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956602#comment-16956602 ] ASF GitHub Bot commented on DRILL-7403: --- paul-rogers commented on pull request #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#discussion_r337308769 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java ## @@ -150,57 +424,63 @@ private int validateOffsetVector(String name, UInt4Vector offsetVector, int valu error(name, offsetVector, "Offset (0) must be 0 but was " + prevOffset); } -// Note <= comparison: offset vectors have (n+1) entries. - -for (int i = 1; i <= valueCount; i++) { - int offset = accessor.get(i); +for (int i = 1; i < offsetCount; i++) { + final int offset = accessor.get(i); if (offset < prevOffset) { -error(name, offsetVector, "Decreasing offsets at (" + (i-1) + ", " + i + ") = (" + prevOffset + ", " + offset + ")"); +error(name, offsetVector, String.format( +"Offset vector [%d] contained %d, expected >= %d", +i, offset, prevOffset)); } else if (offset > maxOffset) { -error(name, offsetVector, "Invalid offset at index " + i + " = " + offset + " exceeds maximum of " + maxOffset); +error(name, offsetVector, String.format( +"Invalid offset at index %d: %d exceeds maximum of %d", +i, offset, maxOffset)); } prevOffset = offset; } return prevOffset; } private void error(String name, ValueVector vector, String msg) { -if (errorCount == 0) { - logger.error("Found one or more vector errors from " + batch.getClass().getSimpleName()); -} -errorCount++; -if (errorCount >= MAX_ERRORS) { - return; -} -String fullMsg = "Column " + name + " of type " + vector.getClass().getSimpleName( ) + ": " + msg; -logger.error(fullMsg); -if (errorList != null) { - errorList.add(fullMsg); -} +errorReporter.error(name, vector, msg); } - private void validateNullableVector(String name, NullableVector vector) { -// Can't validate at this time because the bits vector is in each -// generated subtype. - -// Validate a VarChar vector because it is common. - -if (vector instanceof NullableVarCharVector) { - VarCharVector values = ((NullableVarCharVector) vector).getValuesVector(); - validateVarCharVector(name + "-values", values, rowCount); + private void verifyIsSetVector(ValueVector parent, UInt1Vector bv) { +final String name = String.format("%s (%s)-bits", +parent.getField().getName(), +parent.getClass().getSimpleName()); +final int rowCount = parent.getAccessor().getValueCount(); +final int bitCount = bv.getAccessor().getValueCount(); +if (bitCount != rowCount) { + error(name, bv, String.format( + "Value count = %d, but bit count = %d", + rowCount, bitCount)); +} +final UInt1Vector.Accessor ba = bv.getAccessor(); +for (int i = 0; i < bitCount; i++) { + final int value = ba.get(i); + if (value != 0 && value != 1) { +error(name, bv, String.format( +"%s %s: bit vector[%d] = %d, expected 0 or 1", +i, value)); + } } - } - - private void validateFixedWidthVector(String name, FixedWidthVector vector) { -// TODO Auto-generated method stub - } /** - * Obtain the list of errors. For use in unit-testing this class. - * @return the list of errors found, or null if error capture was - * not enabled + * Print a record batch. Uses code only available in a test build. + * Classes are not visible to the compiler; must load dynamically. + * Does nothing if the class is not available. */ - public List errors() { return errorList; } + public static void print(RecordBatch batch) { +try { + final Class helper = Class.forName("org.apache.drill.test.rowSet.RowSetUtilities"); Review comment: Removed for now, will add back later (and fix any issues) if/when needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Validate batch checks, vector integretity in unit tests > --- > > Key: DRILL-7403 > URL: https://issues.apache.org/jira/browse/DRILL-7403 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0, 1.17.0 >Reporter: Paul Rogers >Assignee: Pa
[jira] [Closed] (DRILL-7417) Test Task
[ https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sorabh Hamirwasia closed DRILL-7417. Resolution: Invalid > Test Task > - > > Key: DRILL-7417 > URL: https://issues.apache.org/jira/browse/DRILL-7417 > Project: Apache Drill > Issue Type: Task >Reporter: Sorabh Hamirwasia >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7417) Test Task
[ https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sorabh Hamirwasia updated DRILL-7417: - Attachment: Test.rtf > Test Task > - > > Key: DRILL-7417 > URL: https://issues.apache.org/jira/browse/DRILL-7417 > Project: Apache Drill > Issue Type: Task >Reporter: Sorabh Hamirwasia >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7417) Test Task
Sorabh Hamirwasia created DRILL-7417: Summary: Test Task Key: DRILL-7417 URL: https://issues.apache.org/jira/browse/DRILL-7417 Project: Apache Drill Issue Type: Task Reporter: Sorabh Hamirwasia -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7417) Test Task
[ https://issues.apache.org/jira/browse/DRILL-7417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sorabh Hamirwasia updated DRILL-7417: - Attachment: (was: Test.rtf) > Test Task > - > > Key: DRILL-7417 > URL: https://issues.apache.org/jira/browse/DRILL-7417 > Project: Apache Drill > Issue Type: Task >Reporter: Sorabh Hamirwasia >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage
[ https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Girish closed DRILL-7405. -- > Build fails due to inaccessible apache-drill on S3 storage > -- > > Key: DRILL-7405 > URL: https://issues.apache.org/jira/browse/DRILL-7405 > Project: Apache Drill > Issue Type: Task > Components: Tools, Build & Test >Affects Versions: 1.16.0 >Reporter: Boaz Ben-Zvi >Assignee: Abhishek Girish >Priority: Critical > Labels: ready-to-commit > Fix For: 1.17.0 > > > A new clean build (e.g. after deleting the ~/.m2 local repository) would > fail now due to: > Access denied to: > [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=] > > (e.g., for the test data sf-0.01_tpc-h_parquet_typed.tgz ) > A new publicly available storage place is needed, plus appropriate changes in > Drill to get to these resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956157#comment-16956157 ] ASF GitHub Bot commented on DRILL-5674: --- arina-ielchiieva commented on pull request #1879: DRILL-5674: Support ZIP compression URL: https://github.com/apache/drill/pull/1879 1. Added ZipCodec implementation which can read / write single file. 2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression. 3. Added unit tests. 4. General refactoring. Jira - [DRILL-5674](https://issues.apache.org/jira/browse/DRILL-5674). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-5674) Drill should support .zip compression
[ https://issues.apache.org/jira/browse/DRILL-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-5674: Reviewer: Vova Vysotskyi > Drill should support .zip compression > - > > Key: DRILL-5674 > URL: https://issues.apache.org/jira/browse/DRILL-5674 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > Zip is a very common compression format. Create a compressed CSV file with > column headers: data.csv.zip. > Define a storage plugin config for the file, call it "dfs.myws", set > delimiter = ",", extract header = true, skip header = false. > Run a simple query: > SELECT * FROM dfs.myws.`data.csv.zip` > The result is garbage as the CSV reader is trying to parse Zipped data as if > it were text. > DRILL-5506 asks how to do this; the responder said to add a library to the > path. Better would be to simply support zip out-of-the-box as a default > format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7416) Updates required to dependencies to resolve potential security vulnerabilities
Bradley Parker created DRILL-7416: - Summary: Updates required to dependencies to resolve potential security vulnerabilities Key: DRILL-7416 URL: https://issues.apache.org/jira/browse/DRILL-7416 Project: Apache Drill Issue Type: Bug Affects Versions: 1.16.0 Reporter: Bradley Parker After running an OWASP Dependency Check and ruling out false positives, I have found 25 dependencies that should be updated to remove potential vulnerabilities. They are listed alphabetically with their CVE information below. [CVSS scores|[https://en.wikipedia.org/wiki/Common_Vulnerability_Scoring_System]] represent the severity of a vulnerability on a scale of 1-10, 10 being critical. [CVEs |[https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_Exposures]] are public identifiers used to reference known vulnerabilities. Package: avro-1.8.2 Should be: 1.9.0 (*Existing item at* *DRILL-7302*) Max CVE (CVSS): CVE-2018-10237 (5.9) Complete CVE list: CVE-2018-10237 Package: commons-beanutils-1.9.2 Should be: 1.9.4 Max CVE (CVSS): CVE-2019-10086 (7.3) Complete CVE list: CVE-2019-10086 Package: commons-beanutils-core-1.8.0 Should be: Moved to commons-beanutils Max CVE (CVSS): CVE-2014-0114 (7.5) Complete CVE list: CVE-2014-0114Deprecated, replaced by commons-beanutils Package: converter-jackson Should be: 2.5.0 Max CVE (CVSS): CVE-2018-1000850 (7.5) Complete CVE list: CVE-2018-1000850 Package: derby-10.10.2.0 Should be: 10.14.2.0 Max CVE (CVSS): CVE-2015-1832 (9.1) Complete CVE list: CVE-2015-1832 CVE-2018-1313 Package: drill-hive-exec-shaded Should be: New release needed with updated Guava Max CVE (CVSS): CVE-2018-10237 (7.5) Complete CVE list: CVE-2018-10237 Package: drill-java-exec Should be: New release needed with updated JjQuery and Bootstrap Max CVE (CVSS): CVE-2019-11358 (6.1) Complete CVE list: CVE-2018-14040 CVE-2018-14041 CVE-2018-14042 CVE-2019-8331 CVE-2019-11358 Package: drill-shaded-guava-23 Should be: New release needed with updated Guava Max CVE (CVSS): CVE-2018-10237 (5.9) Complete CVE list: CVE-2018-10237 Package: guava-19.0 Should be: 24.1.1 Max CVE (CVSS): CVE-2018-10237 (5.9) Complete CVE list: CVE-2018-10237 Package: hadoop-yarn-common-2.7.4 Should be: 3.2.1 Max CVE (CVSS): CVE-2019-11358 (6.1) Complete CVE list: CVE-2012-6708 CVE-2015-9251 CVE-2019-11358 CVE-2010-5312 CVE-2016-7103 Package: hbase-http-2.1.1.jar Should be: 2.1.4 Max CVE (CVSS): CVE-2019-0212 (7.5) Complete CVE list: CVE-2019-0212 Package: httpclient-4.2.5.jar Should be: 4.3.6 Max CVE (CVSS): CVE-2014-3577 (5.8) Complete CVE list: CVE-2014-3577 CVE-2015-5262 Package: jackson-databind-2.9.5 Should be: 2.10.0 Max CVE (CVSS): CVE-2018-14721 (10) Complete CVE list: CVE-2019-17267 CVE-2019-16943 CVE-2019-16942 CVE-2019-16335 CVE-2019-14540 CVE-2019-14439 CVE-2019-14379 CVE-2018-11307 CVE-2019-12384 CVE-2019-12814 CVE-2019-12086 CVE-2018-12023 CVE-2018-12022 CVE-2018-19362 CVE-2018-19361 CVE-2018-19360 CVE-2018-14721 CVE-2018-14720 CVE-2018-14719 CVE-2018-14718 CVE-2018-1000873 Package: jetty-server-9.3.25.v20180904.jar (*Existing DRILL-7135, but that's to go to 9.4 and it's blocked, we should go to latest 9.3 in the meantime*) Should be: 9.3.27.v20190418 Max CVE (CVSS): CVE-2017-9735 (7.5) Complete CVE list: CVE-2017-9735 CVE-2019-10241 CVE-2019-10247 Package: Kafka 0.11.0.1 Should be: 2.2.0 (*Existing item DRILL-6739*) Max CVE (CVSS): CVE-2018-17196 (8.8) Complete CVE list: CVE-2018-17196 CVE-2018-1288 CVE-2017-12610 Package: kudu-client-1.3.0.jar Should be: 1.10.0 Max CVE (CVSS): CVE-2015-5237 (8.8) Complete CVE list: CVE-2018-10237 CVE-2015-5237 CVE-2019-16869Only a partial fix, no fix for netty CVE-2019-16869 (7.5), kudu still needs to update their netty (this is not unexpected as this CVE is newer) Package: libfb303-0.9.3.jar Should be: 0.12.0 Max CVE (CVSS): CVE-2018-1320 (7.5) Complete CVE list: CVE-2018-1320Moved to libthrift Package: okhttp-3.3.0 Should be: 3.12.0 Max CVE (CVSS): CVE-2018-20200 (5.9) Complete CVE list: CVE-2018-20200 Package: protobuf-java-2.5.0 Should be: 3.4.0 Max CVE (CVSS): CVE-2015-5237 (8.8) Complete CVE list: CVE-2015-5237 Package: retrofit-2.1.0 Should be: 2.5.0 Max CVE (CVSS): CVE-2018-1000850 (7.5) Complete CVE list: CVE-2018-1000850 Package: scala-library-2.11.0 Should be: 2.11.12 Max CVE (CVSS): CVE-2017-15288 (7.8) Complete CVE list: CVE-2017-15288 Package: serializer-2.7.1 Should be: 2.7.2 Max CVE (CVSS): CVE-2014-0107 (7.5) Complete CVE list: CVE-2014-0107 Package: xalan-2.7.1 Should be: 2.7.2 Max CVE (CVSS): CVE-2014-0107 (7.5) Complete CVE list: CVE-2014-0107 Package: xercesImpl-2.11.0 Should be: 2.12.0 Max CVE (CVSS): CVE-2012-0881 (7.5) Complete CVE list: CVE-2012-0881 Package: zookeeper-3.4.12. Should be: 3.4.14 Max CVE (CVSS): CVE-2019-0201 (5.9) Complete CVE list: CVE-2019-0201 Additional keywords for searching: Vulnerabili
[jira] [Updated] (DRILL-3850) Execute multiple commands from sqlline -q
[ https://issues.apache.org/jira/browse/DRILL-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-3850: Fix Version/s: 1.17.0 > Execute multiple commands from sqlline -q > - > > Key: DRILL-3850 > URL: https://issues.apache.org/jira/browse/DRILL-3850 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI >Affects Versions: 1.1.0, 1.2.0 > Environment: Mint 17.1 >Reporter: Philip Deegan >Priority: Major > Fix For: 1.17.0 > > > Be able to perform > {noformat} > ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp; alter session set > \`store.format\`='csv';" > {noformat} > instead of > {noformat} > ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp;" > ./sqlline -u jdbc:drill:zk=local -q "alter session set > \`store.format\`='csv';" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-3850) Execute multiple commands from sqlline -q
[ https://issues.apache.org/jira/browse/DRILL-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva resolved DRILL-3850. - Resolution: Fixed Fixed in the scope of DRILL-7401. > Execute multiple commands from sqlline -q > - > > Key: DRILL-3850 > URL: https://issues.apache.org/jira/browse/DRILL-3850 > Project: Apache Drill > Issue Type: Bug > Components: Client - CLI >Affects Versions: 1.1.0, 1.2.0 > Environment: Mint 17.1 >Reporter: Philip Deegan >Priority: Major > Fix For: 1.17.0 > > > Be able to perform > {noformat} > ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp; alter session set > \`store.format\`='csv';" > {noformat} > instead of > {noformat} > ./sqlline -u jdbc:drill:zk=local -q "use dfs.tmp;" > ./sqlline -u jdbc:drill:zk=local -q "alter session set > \`store.format\`='csv';" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7415) Information schema query fails for postgres foreign tables
Igor Guzenko created DRILL-7415: --- Summary: Information schema query fails for postgres foreign tables Key: DRILL-7415 URL: https://issues.apache.org/jira/browse/DRILL-7415 Project: Apache Drill Issue Type: Bug Affects Versions: 1.16.0 Reporter: Igor Guzenko Assignee: Igor Guzenko Fix For: Future 1) Setup a JDBC driver in Drill to Postgres 2) Create a public foreign tables like below in postgres public | vessel | foreign table | postgres public | vessel_movement | foreign table | postgres public | vessel_movement_hist | foreign table | postgres 3) Execute query in Drill {code:sql}SELECT * FROM `INFORMATION_SCHEMA`.`TABLES`;{code} *Actual result* {code} Caused by: java.lang.IllegalArgumentException: Multiple entries with same key: vessel=JdbcTable {vessel} and vessel=JdbcTable {vessel} at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:136) ~[guava-19.0.jar:na] at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:98) ~[guava-19.0.jar:na] at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:84) ~[guava-19.0.jar:na] at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:295) ~[guava-19.0.jar:na] at org.apache.calcite.adapter.jdbc.JdbcSchema.computeTables(JdbcSchema.java:269) ~[calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0] at org.apache.calcite.adapter.jdbc.JdbcSchema.getTableMap(JdbcSchema.java:285) ~[calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0] at org.apache.calcite.adapter.jdbc.JdbcSchema.getTableNames(JdbcSchema.java:410) ~[calcite-core-1.18.0-drill-r0.jar:1.18.0-drill-r0] at org.apache.drill.exec.store.jdbc.JdbcStoragePlugin$CapitalizingJdbcSchema.getTableNames(JdbcStoragePlugin.java:282) ~[drill-jdbc-storage-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.AbstractSchema.getTableNamesAndTypes(AbstractSchema.java:299) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator$Tables.visitTables(InfoSchemaRecordGenerator.java:340) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:254) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:247) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:247) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaRecordGenerator.scanSchema(InfoSchemaRecordGenerator.java:234) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaTableType.getRecordReader(InfoSchemaTableType.java:58) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaBatchCreator.getBatch(InfoSchemaBatchCreator.java:34) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.store.ischema.InfoSchemaBatchCreator.getBatch(InfoSchemaBatchCreator.java:30) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:159) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:137) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:110) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) ~[drill-java-exec-1.16.0.jar:1.16.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:263) [drill-java-exec-1.16.0.jar:1.16.0] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7401) Sqlline 1.9 upgrade
[ https://issues.apache.org/jira/browse/DRILL-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956046#comment-16956046 ] ASF GitHub Bot commented on DRILL-7401: --- asfgit commented on pull request #1875: DRILL-7401: Upgrade to SqlLine 1.9.0 URL: https://github.com/apache/drill/pull/1875 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Sqlline 1.9 upgrade > --- > > Key: DRILL-7401 > URL: https://issues.apache.org/jira/browse/DRILL-7401 > Project: Apache Drill > Issue Type: Task >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: ready-to-commit > Fix For: 1.17.0 > > > Upgrade to SqlLine 1.9 once it is released > (https://github.com/julianhyde/sqlline/issues/350). > *TODO:* > 1. Add SqlLine properties: > {{connectInteractionMode: useNPTogetherOrEmpty}} - supports connection > mehanism used in SqlLine 1.17 and earlier: > a. if user and password are not indicated, connects without them (user and > password are set t empty string): {{./drill-embedded}} > b. if user is indicated, asks for password in interactive mode: > {{./drill-embedded -n "user1"}} > c. if user is indicated as empty string, behaives like in point a (user and > password are set t empty string): {{./drill-embedded -n ""}} > d. if user and password are indicated, connects using provided input > {{./drill-embedded -n "user1" -p "123"}} > {{showLineNumbers: true}} - adds line numbers when query is more than one > line: > {noformat} > apache drill> select > 2..semicolon> * > 3..semicolon> from > 4..semicolon> sys.version; > {noformat} > 2. Remove nohup support code from sqlline.sh since it is not needed any more > (nohup support wroks without flag): > {code} > To add nohup support for SQLline script > if [[ ( ! $(ps -o stat= -p $$) =~ "+" ) && ! ( -p /dev/stdin ) ]]; then >export SQLLINE_JAVA_OPTS="$SQLLINE_JAVA_OPTS > -Djline.terminal=jline.UnsupportedTerminal" > fi > {code} > 3. Add {{-Dorg.jline.terminal.dumb=true}} to avoid JLine terminal warning > when submitting query in sqlline.sh to execute via {{-e}} or {{-f}}: > {noformat} > Oct 11, 2019 2:14:45 PM org.jline.utils.Log logr > WARNING: Unable to create a system terminal, creating a dumb terminal (enable > debug logging for more information) > {noformat} > 4. Remove unneeded echo commands in sqlline.bat during start up: > {noformat} > drill-embedded.bat > DRILL_ARGS - " -u jdbc:drill:zk=local -n user1 -p ppp" > Calculating HADOOP_CLASSPATH ... > HBASE_HOME not detected... > Calculating Drill classpath... > Apache Drill 1.17.0-SNAPSHOT > "Data is the new oil. Ready to Drill some?" > apache drill> > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage
[ https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956049#comment-16956049 ] ASF GitHub Bot commented on DRILL-7405: --- asfgit commented on pull request #1874: DRILL-7405: Avoiding download of TPC-H data URL: https://github.com/apache/drill/pull/1874 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Build fails due to inaccessible apache-drill on S3 storage > -- > > Key: DRILL-7405 > URL: https://issues.apache.org/jira/browse/DRILL-7405 > Project: Apache Drill > Issue Type: Task > Components: Tools, Build & Test >Affects Versions: 1.16.0 >Reporter: Boaz Ben-Zvi >Assignee: Abhishek Girish >Priority: Critical > Labels: ready-to-commit > Fix For: 1.17.0 > > > A new clean build (e.g. after deleting the ~/.m2 local repository) would > fail now due to: > Access denied to: > [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=] > > (e.g., for the test data sf-0.01_tpc-h_parquet_typed.tgz ) > A new publicly available storage place is needed, plus appropriate changes in > Drill to get to these resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7402) Suppress batch dumps for expected failures in tests
[ https://issues.apache.org/jira/browse/DRILL-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956047#comment-16956047 ] ASF GitHub Bot commented on DRILL-7402: --- asfgit commented on pull request #1872: DRILL-7402: Suppress batch dumps for expected failures in tests URL: https://github.com/apache/drill/pull/1872 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Suppress batch dumps for expected failures in tests > --- > > Key: DRILL-7402 > URL: https://issues.apache.org/jira/browse/DRILL-7402 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Drill provides a way to dump the last few batches when an error occurs. > However, in tests, we often deliberately cause something to fail. In this > case, the batch dump is unnecessary. > This enhancement adds a config property, disabled in tests, that controls the > dump activity. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7412) Minor unit test improvements
[ https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956048#comment-16956048 ] ASF GitHub Bot commented on DRILL-7412: --- asfgit commented on pull request #1876: DRILL-7412: Minor unit test improvements URL: https://github.com/apache/drill/pull/1876 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Minor unit test improvements > > > Key: DRILL-7412 > URL: https://issues.apache.org/jira/browse/DRILL-7412 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Many tests intentionally trigger errors. A debug-only log setting sent those > errors to stdout. The resulting stack dumps simply cluttered the test output, > so disabled error output to the console. > Drill can apply bounds checks to vectors. Tests run via Maven enable bounds > checking. Now, bounds checking is also enabled in "debug mode" (when > assertions are enabled, as in an IDE.) > Drill contains two test frameworks. The older BaseTestQuery was marked as > deprecated, but many tests still use it and are unlikely to be changed soon. > So, removed the deprecated marker to reduce the number of spurious warnings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955998#comment-16955998 ] ASF GitHub Bot commented on DRILL-7414: --- arina-ielchiieva commented on pull request #1878: DRILL-7414: EVF incorrectly sets buffer writer index after rollover URL: https://github.com/apache/drill/pull/1878#discussion_r336972303 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/resultSet/impl/TestResultSetLoaderOverflow.java ## @@ -59,55 +61,64 @@ @Test public void testVectorSizeLimit() { -TupleMetadata schema = new SchemaBuilder() +final TupleMetadata schema = new SchemaBuilder() Review comment: Since Java supports effective final there is no need to excessive use of final keyword unless you want to explicitly indicate that variable is final. I am not going to request the change during code review, I guess this mostly relates to the developers code writing style but since you are making lots of refactoring adding final keyword, just wanted to highlight that this is might be unnecessary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva reassigned DRILL-7414: --- Assignee: Paul Rogers (was: Arina Ielchiieva) > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7414: Reviewer: Arina Ielchiieva > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7414: Fix Version/s: 1.17.0 > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Minor > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7414: Labels: ready-to-commit (was: ) > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva reassigned DRILL-7414: --- Assignee: Arina Ielchiieva (was: Paul Rogers) > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Arina Ielchiieva >Priority: Minor > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955992#comment-16955992 ] ASF GitHub Bot commented on DRILL-7414: --- arina-ielchiieva commented on pull request #1878: DRILL-7414: EVF incorrectly sets buffer writer index after rollover URL: https://github.com/apache/drill/pull/1878#discussion_r336972303 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/resultSet/impl/TestResultSetLoaderOverflow.java ## @@ -59,55 +61,64 @@ @Test public void testVectorSizeLimit() { -TupleMetadata schema = new SchemaBuilder() +final TupleMetadata schema = new SchemaBuilder() Review comment: Since Java supports effective final there is no need to excessive use of final keyword unless s you want to explicitly indicate that variable is final. I am not going to request the change during code review, I guess this mostly relates to the developers code writing style but since you are making lots of refactoring adding final keyword, just wanted to highlight that this is might be unnecessary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955991#comment-16955991 ] ASF GitHub Bot commented on DRILL-7414: --- arina-ielchiieva commented on pull request #1878: DRILL-7414: EVF incorrectly sets buffer writer index after rollover URL: https://github.com/apache/drill/pull/1878#discussion_r336971442 ## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/OffsetVectorWriterImpl.java ## @@ -290,7 +290,7 @@ public void preRollover() { // rows. But, this being an offset vector, we add one to account // for the extra 0 value at the start. -setValueCount(vectorIndex.rowStartIndex() + 1); +setValueCount(vectorIndex.rowStartIndex()); Review comment: ` But, this being an offset vector, we add one to account for the extra 0 value at the start.` - Should this comment be updated after the change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7414) EVF incorrectly sets buffer writer index after rollover
[ https://issues.apache.org/jira/browse/DRILL-7414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955993#comment-16955993 ] ASF GitHub Bot commented on DRILL-7414: --- arina-ielchiieva commented on issue #1878: DRILL-7414: EVF incorrectly sets buffer writer index after rollover URL: https://github.com/apache/drill/pull/1878#issuecomment-544478701 @paul-rogers mostly looks good, one minor concern about if need to update the comment after code change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EVF incorrectly sets buffer writer index after rollover > --- > > Key: DRILL-7414 > URL: https://issues.apache.org/jira/browse/DRILL-7414 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > A full test run, with vector validation enabled and with the "new" scan > enabled, revealed the following in {{TestMockPlugin.testSizeLimit()}}: > {noformat} > comments_s2 - VarCharVector: Row count = 838, but value count = 839 > {noformat} > Adding vector validation to the result set loader overflow tests reveals that > the problem is in overflow. In > {{TestResultSetLoaderOverflow.testOverflowWithNullables()}}: > {noformat} > a - RepeatedIntVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Row count = 2952, but value count = 2953 > b - RepeatedVarCharVector: Vector has 2953 values, but offset vector labels > 32472 values > c - RepeatedIntVector: Row count = 2952, but value count = 2953 > d - RepeatedIntVector: Row count = 2952, but value count = 2953 > {noformat} > The problem is that EVF incorrectly sets the offset buffer writer index after > a rollover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7413) Scan operator does not set the container record count
[ https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7413: Reviewer: Arina Ielchiieva > Scan operator does not set the container record count > - > > Key: DRILL-7413 > URL: https://issues.apache.org/jira/browse/DRILL-7413 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Enable the vector checking provided in DRILL-7403. Enable just for the JSON > reader. You will get the following error: > {noformat} > 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > ScanBatch > ScanBatch: Container record count not set > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7413) Scan operator does not set the container record count
[ https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7413: Fix Version/s: 1.17.0 > Scan operator does not set the container record count > - > > Key: DRILL-7413 > URL: https://issues.apache.org/jira/browse/DRILL-7413 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Enable the vector checking provided in DRILL-7403. Enable just for the JSON > reader. You will get the following error: > {noformat} > 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > ScanBatch > ScanBatch: Container record count not set > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7413) Scan operator does not set the container record count
[ https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955986#comment-16955986 ] ASF GitHub Bot commented on DRILL-7413: --- arina-ielchiieva commented on issue #1877: DRILL-7413: Test and fix scan operator vectors URL: https://github.com/apache/drill/pull/1877#issuecomment-544476885 LGTM, +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Scan operator does not set the container record count > - > > Key: DRILL-7413 > URL: https://issues.apache.org/jira/browse/DRILL-7413 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Enable the vector checking provided in DRILL-7403. Enable just for the JSON > reader. You will get the following error: > {noformat} > 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > ScanBatch > ScanBatch: Container record count not set > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7413) Scan operator does not set the container record count
[ https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955985#comment-16955985 ] ASF GitHub Bot commented on DRILL-7413: --- arina-ielchiieva commented on issue #1877: DRILL-7413: Test and fix scan operator vectors URL: https://github.com/apache/drill/pull/1877#issuecomment-544476885 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Scan operator does not set the container record count > - > > Key: DRILL-7413 > URL: https://issues.apache.org/jira/browse/DRILL-7413 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Enable the vector checking provided in DRILL-7403. Enable just for the JSON > reader. You will get the following error: > {noformat} > 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > ScanBatch > ScanBatch: Container record count not set > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7413) Scan operator does not set the container record count
[ https://issues.apache.org/jira/browse/DRILL-7413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7413: Labels: ready-to-commit (was: ) > Scan operator does not set the container record count > - > > Key: DRILL-7413 > URL: https://issues.apache.org/jira/browse/DRILL-7413 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > > Enable the vector checking provided in DRILL-7403. Enable just for the JSON > reader. You will get the following error: > {noformat} > 12:36:57.399 [22549a3d-a937-df51-2e13-4b032ba143f9:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > ScanBatch > ScanBatch: Container record count not set > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage
[ https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7405: Labels: ready-to-commit (was: ) > Build fails due to inaccessible apache-drill on S3 storage > -- > > Key: DRILL-7405 > URL: https://issues.apache.org/jira/browse/DRILL-7405 > Project: Apache Drill > Issue Type: Task > Components: Tools, Build & Test >Affects Versions: 1.16.0 >Reporter: Boaz Ben-Zvi >Assignee: Abhishek Girish >Priority: Critical > Labels: ready-to-commit > Fix For: 1.17.0 > > > A new clean build (e.g. after deleting the ~/.m2 local repository) would > fail now due to: > Access denied to: > [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=] > > (e.g., for the test data sf-0.01_tpc-h_parquet_typed.tgz ) > A new publicly available storage place is needed, plus appropriate changes in > Drill to get to these resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage
[ https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955984#comment-16955984 ] ASF GitHub Bot commented on DRILL-7405: --- arina-ielchiieva commented on issue #1874: DRILL-7405: Avoiding download of TPC-H data URL: https://github.com/apache/drill/pull/1874#issuecomment-544476019 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Build fails due to inaccessible apache-drill on S3 storage > -- > > Key: DRILL-7405 > URL: https://issues.apache.org/jira/browse/DRILL-7405 > Project: Apache Drill > Issue Type: Task > Components: Tools, Build & Test >Affects Versions: 1.16.0 >Reporter: Boaz Ben-Zvi >Assignee: Abhishek Girish >Priority: Critical > Fix For: 1.17.0 > > > A new clean build (e.g. after deleting the ~/.m2 local repository) would > fail now due to: > Access denied to: > [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=] > > (e.g., for the test data sf-0.01_tpc-h_parquet_typed.tgz ) > A new publicly available storage place is needed, plus appropriate changes in > Drill to get to these resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955982#comment-16955982 ] ASF GitHub Bot commented on DRILL-7403: --- arina-ielchiieva commented on issue #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#issuecomment-544475715 @paul-rogers sorry could not merge this commit, since there a couple of minor comments. Mostly I am worried about absent `print` method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Validate batch checks, vector integretity in unit tests > --- > > Key: DRILL-7403 > URL: https://issues.apache.org/jira/browse/DRILL-7403 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0, 1.17.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Drill provides a {{BatchValidator}} that checks vectors. It is disabled by > default. This enhancement adds more checks, including checks for row counts > (of which there are surprisingly many.) > Since most operators will fail if the check is enabled, this enhancement also > adds a table to keep track of which operators pass the checks (and for which > checks should be enabled) and those that still need work. This allows the > checks to exist in the code, and to be enabled incrementally as we fix the > various problems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7412) Minor unit test improvements
[ https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7412: Reviewer: Arina Ielchiieva > Minor unit test improvements > > > Key: DRILL-7412 > URL: https://issues.apache.org/jira/browse/DRILL-7412 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Many tests intentionally trigger errors. A debug-only log setting sent those > errors to stdout. The resulting stack dumps simply cluttered the test output, > so disabled error output to the console. > Drill can apply bounds checks to vectors. Tests run via Maven enable bounds > checking. Now, bounds checking is also enabled in "debug mode" (when > assertions are enabled, as in an IDE.) > Drill contains two test frameworks. The older BaseTestQuery was marked as > deprecated, but many tests still use it and are unlikely to be changed soon. > So, removed the deprecated marker to reduce the number of spurious warnings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7412) Minor unit test improvements
[ https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7412: Fix Version/s: 1.17.0 > Minor unit test improvements > > > Key: DRILL-7412 > URL: https://issues.apache.org/jira/browse/DRILL-7412 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.17.0 > > > Many tests intentionally trigger errors. A debug-only log setting sent those > errors to stdout. The resulting stack dumps simply cluttered the test output, > so disabled error output to the console. > Drill can apply bounds checks to vectors. Tests run via Maven enable bounds > checking. Now, bounds checking is also enabled in "debug mode" (when > assertions are enabled, as in an IDE.) > Drill contains two test frameworks. The older BaseTestQuery was marked as > deprecated, but many tests still use it and are unlikely to be changed soon. > So, removed the deprecated marker to reduce the number of spurious warnings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7412) Minor unit test improvements
[ https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955979#comment-16955979 ] ASF GitHub Bot commented on DRILL-7412: --- arina-ielchiieva commented on issue #1876: DRILL-7412: Minor unit test improvements URL: https://github.com/apache/drill/pull/1876#issuecomment-544475117 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Minor unit test improvements > > > Key: DRILL-7412 > URL: https://issues.apache.org/jira/browse/DRILL-7412 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Many tests intentionally trigger errors. A debug-only log setting sent those > errors to stdout. The resulting stack dumps simply cluttered the test output, > so disabled error output to the console. > Drill can apply bounds checks to vectors. Tests run via Maven enable bounds > checking. Now, bounds checking is also enabled in "debug mode" (when > assertions are enabled, as in an IDE.) > Drill contains two test frameworks. The older BaseTestQuery was marked as > deprecated, but many tests still use it and are unlikely to be changed soon. > So, removed the deprecated marker to reduce the number of spurious warnings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7412) Minor unit test improvements
[ https://issues.apache.org/jira/browse/DRILL-7412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-7412: Labels: ready-to-commit (was: ) > Minor unit test improvements > > > Key: DRILL-7412 > URL: https://issues.apache.org/jira/browse/DRILL-7412 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Many tests intentionally trigger errors. A debug-only log setting sent those > errors to stdout. The resulting stack dumps simply cluttered the test output, > so disabled error output to the console. > Drill can apply bounds checks to vectors. Tests run via Maven enable bounds > checking. Now, bounds checking is also enabled in "debug mode" (when > assertions are enabled, as in an IDE.) > Drill contains two test frameworks. The older BaseTestQuery was marked as > deprecated, but many tests still use it and are unlikely to be changed soon. > So, removed the deprecated marker to reduce the number of spurious warnings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955976#comment-16955976 ] ASF GitHub Bot commented on DRILL-7403: --- arina-ielchiieva commented on pull request #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#discussion_r336965873 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java ## @@ -45,103 +49,373 @@ */ public class BatchValidator { - private static final org.slf4j.Logger logger = - org.slf4j.LoggerFactory.getLogger(BatchValidator.class); + private static final Logger logger = LoggerFactory.getLogger(BatchValidator.class); + public static final boolean LOG_TO_STDOUT = true; public static final int MAX_ERRORS = 100; - private final int rowCount; - private final VectorAccessible batch; - private final List errorList; - private int errorCount; + public interface ErrorReporter { +void error(String name, ValueVector vector, String msg); +void warn(String name, ValueVector vector, String msg); +void error(String msg); +int errorCount(); + } + + public abstract static class BaseErrorReporter implements ErrorReporter { + +private final String opName; +private int errorCount; + +public BaseErrorReporter(String opName) { + this.opName = opName; +} + +protected boolean startError() { + if (errorCount == 0) { +warn("Found one or more vector errors from " + opName); + } + errorCount++; + if (errorCount >= MAX_ERRORS) { +return false; + } + return true; +} + +@Override +public void error(String name, ValueVector vector, String msg) { + error(String.format("%s - %s: %s", +name, vector.getClass().getSimpleName(), msg)); +} + +@Override +public void warn(String name, ValueVector vector, String msg) { + warn(String.format("%s - %s: %s", +name, vector.getClass().getSimpleName(), msg)); +} + +public abstract void warn(String msg); + +@Override +public int errorCount() { return errorCount; } + } + + private static class StdOutReporter extends BaseErrorReporter { + +public StdOutReporter(String opName) { + super(opName); +} + +@Override +public void error(String msg) { + if (startError()) { +System.out.println(msg); + } +} + +@Override +public void warn(String msg) { + System.out.println(msg); +} + } + + private static class LogReporter extends BaseErrorReporter { + +public LogReporter(String opName) { + super(opName); +} + +@Override +public void error(String msg) { + if (startError()) { +logger.error(msg); + } +} + +@Override +public void warn(String msg) { + logger.error(msg); +} + } + + private enum CheckMode { COUNTS, ALL }; + + private static final Map, CheckMode> checkRules = buildRules(); - public BatchValidator(VectorAccessible batch) { -rowCount = batch.getRecordCount(); -this.batch = batch; -errorList = null; + private final ErrorReporter errorReporter; + + public BatchValidator(ErrorReporter errorReporter) { +this.errorReporter = errorReporter; + } + + /** + * At present, most operators will not pass the checks here. The following + * table identifies those that should be checked, and the degree of check. + * Over time, this table should include all operators, and thus become + * unnecessary. + */ + private static Map, CheckMode> buildRules() { +final Map, CheckMode> rules = new IdentityHashMap<>(); +//rules.put(OperatorRecordBatch.class, CheckMode.ALL); +return rules; } - public BatchValidator(VectorAccessible batch, boolean captureErrors) { -rowCount = batch.getRecordCount(); -this.batch = batch; -if (captureErrors) { - errorList = new ArrayList<>(); + public static boolean validate(RecordBatch batch) { +final CheckMode checkMode = checkRules.get(batch.getClass()); + +// If no rule, don't check this batch. + +if (checkMode == null) { + + // As work proceeds, might want to log those batches not checked. + // For now, there are too many. + + return true; +} + +// All batches that do any checks will at least check counts. + +final ErrorReporter reporter = errorReporter(batch); +final int rowCount = batch.getRecordCount(); +int valueCount = rowCount; +final VectorContainer container = batch.getContainer(); +if (!container.hasRecordCount()) { + reporter.error(String.format( + "%s: Container record count not set", + batch.getClass().getSimpleName())); } else { - errorList = null; + // Row count will = container count for most operators. + // Row c
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955974#comment-16955974 ] ASF GitHub Bot commented on DRILL-7403: --- arina-ielchiieva commented on pull request #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#discussion_r336966111 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java ## @@ -150,57 +424,63 @@ private int validateOffsetVector(String name, UInt4Vector offsetVector, int valu error(name, offsetVector, "Offset (0) must be 0 but was " + prevOffset); } -// Note <= comparison: offset vectors have (n+1) entries. - -for (int i = 1; i <= valueCount; i++) { - int offset = accessor.get(i); +for (int i = 1; i < offsetCount; i++) { + final int offset = accessor.get(i); if (offset < prevOffset) { -error(name, offsetVector, "Decreasing offsets at (" + (i-1) + ", " + i + ") = (" + prevOffset + ", " + offset + ")"); +error(name, offsetVector, String.format( +"Offset vector [%d] contained %d, expected >= %d", +i, offset, prevOffset)); } else if (offset > maxOffset) { -error(name, offsetVector, "Invalid offset at index " + i + " = " + offset + " exceeds maximum of " + maxOffset); +error(name, offsetVector, String.format( +"Invalid offset at index %d: %d exceeds maximum of %d", +i, offset, maxOffset)); } prevOffset = offset; } return prevOffset; } private void error(String name, ValueVector vector, String msg) { -if (errorCount == 0) { - logger.error("Found one or more vector errors from " + batch.getClass().getSimpleName()); -} -errorCount++; -if (errorCount >= MAX_ERRORS) { - return; -} -String fullMsg = "Column " + name + " of type " + vector.getClass().getSimpleName( ) + ": " + msg; -logger.error(fullMsg); -if (errorList != null) { - errorList.add(fullMsg); -} +errorReporter.error(name, vector, msg); } - private void validateNullableVector(String name, NullableVector vector) { -// Can't validate at this time because the bits vector is in each -// generated subtype. - -// Validate a VarChar vector because it is common. - -if (vector instanceof NullableVarCharVector) { - VarCharVector values = ((NullableVarCharVector) vector).getValuesVector(); - validateVarCharVector(name + "-values", values, rowCount); + private void verifyIsSetVector(ValueVector parent, UInt1Vector bv) { +final String name = String.format("%s (%s)-bits", +parent.getField().getName(), +parent.getClass().getSimpleName()); +final int rowCount = parent.getAccessor().getValueCount(); +final int bitCount = bv.getAccessor().getValueCount(); +if (bitCount != rowCount) { + error(name, bv, String.format( + "Value count = %d, but bit count = %d", + rowCount, bitCount)); +} +final UInt1Vector.Accessor ba = bv.getAccessor(); +for (int i = 0; i < bitCount; i++) { + final int value = ba.get(i); + if (value != 0 && value != 1) { +error(name, bv, String.format( Review comment: Looks like string format expecting 4 parameters, but only two are passed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Validate batch checks, vector integretity in unit tests > --- > > Key: DRILL-7403 > URL: https://issues.apache.org/jira/browse/DRILL-7403 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0, 1.17.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Labels: ready-to-commit > Fix For: 1.17.0 > > > Drill provides a {{BatchValidator}} that checks vectors. It is disabled by > default. This enhancement adds more checks, including checks for row counts > (of which there are surprisingly many.) > Since most operators will fail if the check is enabled, this enhancement also > adds a table to keep track of which operators pass the checks (and for which > checks should be enabled) and those that still need work. This allows the > checks to exist in the code, and to be enabled incrementally as we fix the > various problems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7403) Validate batch checks, vector integretity in unit tests
[ https://issues.apache.org/jira/browse/DRILL-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955975#comment-16955975 ] ASF GitHub Bot commented on DRILL-7403: --- arina-ielchiieva commented on pull request #1871: DRILL-7403: Validate batch checks, vector integretity in unit tests URL: https://github.com/apache/drill/pull/1871#discussion_r336966265 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/validate/BatchValidator.java ## @@ -150,57 +424,63 @@ private int validateOffsetVector(String name, UInt4Vector offsetVector, int valu error(name, offsetVector, "Offset (0) must be 0 but was " + prevOffset); } -// Note <= comparison: offset vectors have (n+1) entries. - -for (int i = 1; i <= valueCount; i++) { - int offset = accessor.get(i); +for (int i = 1; i < offsetCount; i++) { + final int offset = accessor.get(i); if (offset < prevOffset) { -error(name, offsetVector, "Decreasing offsets at (" + (i-1) + ", " + i + ") = (" + prevOffset + ", " + offset + ")"); +error(name, offsetVector, String.format( +"Offset vector [%d] contained %d, expected >= %d", +i, offset, prevOffset)); } else if (offset > maxOffset) { -error(name, offsetVector, "Invalid offset at index " + i + " = " + offset + " exceeds maximum of " + maxOffset); +error(name, offsetVector, String.format( +"Invalid offset at index %d: %d exceeds maximum of %d", +i, offset, maxOffset)); } prevOffset = offset; } return prevOffset; } private void error(String name, ValueVector vector, String msg) { -if (errorCount == 0) { - logger.error("Found one or more vector errors from " + batch.getClass().getSimpleName()); -} -errorCount++; -if (errorCount >= MAX_ERRORS) { - return; -} -String fullMsg = "Column " + name + " of type " + vector.getClass().getSimpleName( ) + ": " + msg; -logger.error(fullMsg); -if (errorList != null) { - errorList.add(fullMsg); -} +errorReporter.error(name, vector, msg); } - private void validateNullableVector(String name, NullableVector vector) { -// Can't validate at this time because the bits vector is in each -// generated subtype. - -// Validate a VarChar vector because it is common. - -if (vector instanceof NullableVarCharVector) { - VarCharVector values = ((NullableVarCharVector) vector).getValuesVector(); - validateVarCharVector(name + "-values", values, rowCount); + private void verifyIsSetVector(ValueVector parent, UInt1Vector bv) { +final String name = String.format("%s (%s)-bits", +parent.getField().getName(), +parent.getClass().getSimpleName()); +final int rowCount = parent.getAccessor().getValueCount(); +final int bitCount = bv.getAccessor().getValueCount(); +if (bitCount != rowCount) { + error(name, bv, String.format( + "Value count = %d, but bit count = %d", + rowCount, bitCount)); +} +final UInt1Vector.Accessor ba = bv.getAccessor(); +for (int i = 0; i < bitCount; i++) { + final int value = ba.get(i); + if (value != 0 && value != 1) { +error(name, bv, String.format( +"%s %s: bit vector[%d] = %d, expected 0 or 1", +i, value)); + } } - } - - private void validateFixedWidthVector(String name, FixedWidthVector vector) { -// TODO Auto-generated method stub - } /** - * Obtain the list of errors. For use in unit-testing this class. - * @return the list of errors found, or null if error capture was - * not enabled + * Print a record batch. Uses code only available in a test build. + * Classes are not visible to the compiler; must load dynamically. + * Does nothing if the class is not available. */ - public List errors() { return errorList; } + public static void print(RecordBatch batch) { +try { + final Class helper = Class.forName("org.apache.drill.test.rowSet.RowSetUtilities"); Review comment: Checked `org.apache.drill.test.rowSet.RowSetUtilities` it does not have `print` method. Could you please check? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Validate batch checks, vector integretity in unit tests > --- > > Key: DRILL-7403 > URL: https://issues.apache.org/jira/browse/DRILL-7403 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.16.0, 1.17.0 >Rep
[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter
[ https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955965#comment-16955965 ] ASF GitHub Bot commented on DRILL-6096: --- arina-ielchiieva commented on issue #1873: DRILL-6096: Provide mechanism to configure text writer configuration URL: https://github.com/apache/drill/pull/1873#issuecomment-544469217 @paul-rogers thanks for the code review, addressed code review comments, force-pushed since there were minor changes in the code. Regarding design, the aim of this Jira was just to fix text writer to write proper text files: before if column contained field separator, field was not enclosed in the quotes, thus we were writing text files which Drill could not read. Now when user indicates write format using session option (this is common approach for all formats), Drill produces text files, it can read back. Basically, if user has configured format plugin: ``` "formats": { "csvh": { "type": "text", "extensions": [ "csvh" ], "lineDelimiter": "\n", "fieldDelimiter": ",", "extractHeader": true } }, ``` Drill will be able to read and write such text files correctly. Same approach is used for `parquet`, `json`. All user needs to do is to indicate write format using session option: `alter session set `store.format` = 'csvh';` (`parquet`, `json`). I am not saying this is ideal and we might need to reconsider such writing approach but I guess not in the scope of Jira since such re-design would touch all file writers. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide mechanisms to specify field delimiters and quoted text for > TextRecordWriter > --- > > Key: DRILL-6096 > URL: https://issues.apache.org/jira/browse/DRILL-6096 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.12.0 >Reporter: Kunal Khatua >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.17.0 > > > Currently, there is no way for a user to specify the field delimiter for the > writing records as a text output. Further more, if the fields contain the > delimiter, we have no mechanism of specifying quotes. > By default, quotes should be used to enclose non-numeric fields being written. > *Description of the implemented changes:* > 2 options are added to control text writer output: > {{store.text.writer.add_header}} - indicates if header should be added in > created text file. Default is true. > {{store.text.writer.force_quotes}} - indicates if all value should be quoted. > Default is false. It means only values that contain special characters (line > / field separators) will be quoted. > Line / field separators, quote / escape characters can be configured using > text format configuration using Web UI. User can create special format only > for writing data and then use it when creating files. Though such format can > be always used to read back written data. > {noformat} > "formats": { > "write_text": { > "type": "text", > "extensions": [ > "txt" > ], > "lineDelimiter": "\n", > "fieldDelimiter": "!", > "quote": "^", > "escape": "^", > } >}, > ... > {noformat} > Next set specified format and create text file: > {noformat} > alter session set `store.format` = 'write_text'; > create table dfs.tmp.t as select 1 as id from (values(1)); > {noformat} > Notes: > 1. To write data univocity-parsers are used, they limit line separator length > to not more than 2 characters, though Drill allows setting more 2 chars as > line separator since Drill can read data splitting by line separator of any > length, during data write exception will be thrown. > 2. {{extractHeader}} in text format configuration does not affect if header > will be written to text file, only {{store.text.writer.add_header}} controls > this action. {{extractHeader}} is used only when reading the data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter
[ https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955957#comment-16955957 ] ASF GitHub Bot commented on DRILL-6096: --- arina-ielchiieva commented on pull request #1873: DRILL-6096: Provide mechanism to configure text writer configuration URL: https://github.com/apache/drill/pull/1873#discussion_r336918266 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/writer/TextRecordWriter.java ## @@ -165,10 +182,12 @@ public void writeField() throws IOException { @Override public void cleanup() throws IOException { super.cleanup(); -if (stream != null) { - stream.close(); - stream = null; - logger.debug("closing file"); + +fRecordStarted = false; +if (writer != null) { + writer.close(); Review comment: Good point, will update the code. Catched `IllegalStateException` since `writer.close()` can throw only this expiation and wrapped it into `IOException` since `WriterRecordBatch#closeWriter` method handles properly this type of exception. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide mechanisms to specify field delimiters and quoted text for > TextRecordWriter > --- > > Key: DRILL-6096 > URL: https://issues.apache.org/jira/browse/DRILL-6096 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.12.0 >Reporter: Kunal Khatua >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.17.0 > > > Currently, there is no way for a user to specify the field delimiter for the > writing records as a text output. Further more, if the fields contain the > delimiter, we have no mechanism of specifying quotes. > By default, quotes should be used to enclose non-numeric fields being written. > *Description of the implemented changes:* > 2 options are added to control text writer output: > {{store.text.writer.add_header}} - indicates if header should be added in > created text file. Default is true. > {{store.text.writer.force_quotes}} - indicates if all value should be quoted. > Default is false. It means only values that contain special characters (line > / field separators) will be quoted. > Line / field separators, quote / escape characters can be configured using > text format configuration using Web UI. User can create special format only > for writing data and then use it when creating files. Though such format can > be always used to read back written data. > {noformat} > "formats": { > "write_text": { > "type": "text", > "extensions": [ > "txt" > ], > "lineDelimiter": "\n", > "fieldDelimiter": "!", > "quote": "^", > "escape": "^", > } >}, > ... > {noformat} > Next set specified format and create text file: > {noformat} > alter session set `store.format` = 'write_text'; > create table dfs.tmp.t as select 1 as id from (values(1)); > {noformat} > Notes: > 1. To write data univocity-parsers are used, they limit line separator length > to not more than 2 characters, though Drill allows setting more 2 chars as > line separator since Drill can read data splitting by line separator of any > length, during data write exception will be thrown. > 2. {{extractHeader}} in text format configuration does not affect if header > will be written to text file, only {{store.text.writer.add_header}} controls > this action. {{extractHeader}} is used only when reading the data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter
[ https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955901#comment-16955901 ] ASF GitHub Bot commented on DRILL-6096: --- arina-ielchiieva commented on pull request #1873: DRILL-6096: Provide mechanism to configure text writer configuration URL: https://github.com/apache/drill/pull/1873#discussion_r336918266 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/writer/TextRecordWriter.java ## @@ -165,10 +182,12 @@ public void writeField() throws IOException { @Override public void cleanup() throws IOException { super.cleanup(); -if (stream != null) { - stream.close(); - stream = null; - logger.debug("closing file"); + +fRecordStarted = false; +if (writer != null) { + writer.close(); Review comment: Good point, will update the code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide mechanisms to specify field delimiters and quoted text for > TextRecordWriter > --- > > Key: DRILL-6096 > URL: https://issues.apache.org/jira/browse/DRILL-6096 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.12.0 >Reporter: Kunal Khatua >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.17.0 > > > Currently, there is no way for a user to specify the field delimiter for the > writing records as a text output. Further more, if the fields contain the > delimiter, we have no mechanism of specifying quotes. > By default, quotes should be used to enclose non-numeric fields being written. > *Description of the implemented changes:* > 2 options are added to control text writer output: > {{store.text.writer.add_header}} - indicates if header should be added in > created text file. Default is true. > {{store.text.writer.force_quotes}} - indicates if all value should be quoted. > Default is false. It means only values that contain special characters (line > / field separators) will be quoted. > Line / field separators, quote / escape characters can be configured using > text format configuration using Web UI. User can create special format only > for writing data and then use it when creating files. Though such format can > be always used to read back written data. > {noformat} > "formats": { > "write_text": { > "type": "text", > "extensions": [ > "txt" > ], > "lineDelimiter": "\n", > "fieldDelimiter": "!", > "quote": "^", > "escape": "^", > } >}, > ... > {noformat} > Next set specified format and create text file: > {noformat} > alter session set `store.format` = 'write_text'; > create table dfs.tmp.t as select 1 as id from (values(1)); > {noformat} > Notes: > 1. To write data univocity-parsers are used, they limit line separator length > to not more than 2 characters, though Drill allows setting more 2 chars as > line separator since Drill can read data splitting by line separator of any > length, during data write exception will be thrown. > 2. {{extractHeader}} in text format configuration does not affect if header > will be written to text file, only {{store.text.writer.add_header}} controls > this action. {{extractHeader}} is used only when reading the data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6096) Provide mechanisms to specify field delimiters and quoted text for TextRecordWriter
[ https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955900#comment-16955900 ] ASF GitHub Bot commented on DRILL-6096: --- arina-ielchiieva commented on pull request #1873: DRILL-6096: Provide mechanism to configure text writer configuration URL: https://github.com/apache/drill/pull/1873#discussion_r336916683 ## File path: exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java ## @@ -57,10 +40,27 @@ import org.apache.drill.exec.store.mock.MockStorageEngineConfig; import org.apache.drill.exec.store.sys.store.provider.ZookeeperPersistentStoreProvider; import org.apache.drill.exec.util.StoragePluginTestUtils; - import org.apache.drill.shaded.guava.com.google.common.base.Charsets; import org.apache.drill.shaded.guava.com.google.common.base.Preconditions; +import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableMap; import org.apache.drill.shaded.guava.com.google.common.io.Resources; +import org.apache.drill.test.DrillTestWrapper.TestServices; + +import java.io.File; +import java.io.IOException; +import java.net.URI; +import java.net.URL; +import java.nio.file.Paths; +import java.sql.Connection; +import java.sql.DriverManager; +import java.sql.SQLException; +import java.util.ArrayList; +import java.util.Collection; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Optional; +import java.util.Properties; Review comment: I use standard IntelliJ IDEA imports order: https://user-images.githubusercontent.com/15086720/67193155-811dcc80-f3fd-11e9-8434-53b9a5f598cf.png";> Regarding updating the check styles, there is a Jira (https://issues.apache.org/jira/browse/DRILL-7352) where we can post comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Provide mechanisms to specify field delimiters and quoted text for > TextRecordWriter > --- > > Key: DRILL-6096 > URL: https://issues.apache.org/jira/browse/DRILL-6096 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV >Affects Versions: 1.12.0 >Reporter: Kunal Khatua >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.17.0 > > > Currently, there is no way for a user to specify the field delimiter for the > writing records as a text output. Further more, if the fields contain the > delimiter, we have no mechanism of specifying quotes. > By default, quotes should be used to enclose non-numeric fields being written. > *Description of the implemented changes:* > 2 options are added to control text writer output: > {{store.text.writer.add_header}} - indicates if header should be added in > created text file. Default is true. > {{store.text.writer.force_quotes}} - indicates if all value should be quoted. > Default is false. It means only values that contain special characters (line > / field separators) will be quoted. > Line / field separators, quote / escape characters can be configured using > text format configuration using Web UI. User can create special format only > for writing data and then use it when creating files. Though such format can > be always used to read back written data. > {noformat} > "formats": { > "write_text": { > "type": "text", > "extensions": [ > "txt" > ], > "lineDelimiter": "\n", > "fieldDelimiter": "!", > "quote": "^", > "escape": "^", > } >}, > ... > {noformat} > Next set specified format and create text file: > {noformat} > alter session set `store.format` = 'write_text'; > create table dfs.tmp.t as select 1 as id from (values(1)); > {noformat} > Notes: > 1. To write data univocity-parsers are used, they limit line separator length > to not more than 2 characters, though Drill allows setting more 2 chars as > line separator since Drill can read data splitting by line separator of any > length, during data write exception will be thrown. > 2. {{extractHeader}} in text format configuration does not affect if header > will be written to text file, only {{store.text.writer.add_header}} controls > this action. {{extractHeader}} is used only when reading the data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files
[ https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko resolved DRILL-5183. - Resolution: Fixed Fixed in DRILL-7268. > Drill doesn't seem to handle array values correctly in Parquet files > > > Key: DRILL-5183 > URL: https://issues.apache.org/jira/browse/DRILL-5183 > Project: Apache Drill > Issue Type: Bug >Reporter: Dave Kincaid >Assignee: Igor Guzenko >Priority: Major > Attachments: books.parquet > > > It looks to me that Drill is not properly converting array values in Parquet > records. I have created a simple example and will attach a simple Parquet > file to this issue. If I write Parquet records using the Avro schema > {code:title=Book.avsc} > { "type": "record", > "name": "Book", > "fields": [ > { "name": "title", "type": "string" }, > { "name": "pages", "type": "int" }, > { "name": "authors", "type": {"type": "array", "items": "string"} } > ] > } > {code} > I write two records using this schema into the attached Parquet file and then > simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result: > ||title||pages||authors|| > |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}| > |Foundations of Mathematical Analysis|428|{"array":["Richard > Johnsonbaugh","W.E. Pfaffenberger"]}| > You can see that the authors column seems to be a nested record with the name > "array" instead of being a repeated value. If I change the SQL query to > {{SELECT title,pages,t.authors.`array` FROM > dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then > I get: > ||title||pages||EXPR$2|| > |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]| > |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. > Pfaffenberger"]| > and now that column behaves in Drill as a repeated values column. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files
[ https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko updated DRILL-5183: Fix Version/s: 1.17.0 > Drill doesn't seem to handle array values correctly in Parquet files > > > Key: DRILL-5183 > URL: https://issues.apache.org/jira/browse/DRILL-5183 > Project: Apache Drill > Issue Type: Bug >Reporter: Dave Kincaid >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.17.0 > > Attachments: books.parquet > > > It looks to me that Drill is not properly converting array values in Parquet > records. I have created a simple example and will attach a simple Parquet > file to this issue. If I write Parquet records using the Avro schema > {code:title=Book.avsc} > { "type": "record", > "name": "Book", > "fields": [ > { "name": "title", "type": "string" }, > { "name": "pages", "type": "int" }, > { "name": "authors", "type": {"type": "array", "items": "string"} } > ] > } > {code} > I write two records using this schema into the attached Parquet file and then > simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result: > ||title||pages||authors|| > |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}| > |Foundations of Mathematical Analysis|428|{"array":["Richard > Johnsonbaugh","W.E. Pfaffenberger"]}| > You can see that the authors column seems to be a nested record with the name > "array" instead of being a repeated value. If I change the SQL query to > {{SELECT title,pages,t.authors.`array` FROM > dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then > I get: > ||title||pages||EXPR$2|| > |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]| > |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. > Pfaffenberger"]| > and now that column behaves in Drill as a repeated values column. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-5183) Drill doesn't seem to handle array values correctly in Parquet files
[ https://issues.apache.org/jira/browse/DRILL-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko reassigned DRILL-5183: --- Assignee: Igor Guzenko > Drill doesn't seem to handle array values correctly in Parquet files > > > Key: DRILL-5183 > URL: https://issues.apache.org/jira/browse/DRILL-5183 > Project: Apache Drill > Issue Type: Bug >Reporter: Dave Kincaid >Assignee: Igor Guzenko >Priority: Major > Attachments: books.parquet > > > It looks to me that Drill is not properly converting array values in Parquet > records. I have created a simple example and will attach a simple Parquet > file to this issue. If I write Parquet records using the Avro schema > {code:title=Book.avsc} > { "type": "record", > "name": "Book", > "fields": [ > { "name": "title", "type": "string" }, > { "name": "pages", "type": "int" }, > { "name": "authors", "type": {"type": "array", "items": "string"} } > ] > } > {code} > I write two records using this schema into the attached Parquet file and then > simply run {{SELECT * FROM dfs.`books.parquet`}} I get the following result: > ||title||pages||authors|| > |Physics of Waves|477|{"array":["William C. Elmore","Mark A. Heald"]}| > |Foundations of Mathematical Analysis|428|{"array":["Richard > Johnsonbaugh","W.E. Pfaffenberger"]}| > You can see that the authors column seems to be a nested record with the name > "array" instead of being a repeated value. If I change the SQL query to > {{SELECT title,pages,t.authors.`array` FROM > dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t;}} then > I get: > ||title||pages||EXPR$2|| > |Physics of Waves|477|["William C. Elmore","Mark A. Heald"]| > |Foundations of Mathematical Analysis|428|["Richard Johnsonbaugh","W.E. > Pfaffenberger"]| > and now that column behaves in Drill as a repeated values column. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-1999) Drill should expose the Parquet logical schema rather than the physical schema
[ https://issues.apache.org/jira/browse/DRILL-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko updated DRILL-1999: Fix Version/s: (was: Future) 1.17.0 > Drill should expose the Parquet logical schema rather than the physical schema > -- > > Key: DRILL-1999 > URL: https://issues.apache.org/jira/browse/DRILL-1999 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Fix For: 1.17.0 > > Attachments: hive_alltypes.parquet > > > Created a parquet file in hive having the following DDL > hive> desc alltypesparquet; > OK > c1 int > c2 boolean > c3 double > c4 string > c5 array > c6 map > c7 map > c8 struct > c9 tinyint > c10 smallint > c11 float > c12 bigint > c13 array> > c15 struct> > c16 array,n:int>> > Time taken: 0.076 seconds, Fetched: 15 row(s) > column5 which is an array of integers shows up as a bag when querying through > drill > 0: jdbc:drill:> select c5 from `/user/hive/warehouse/alltypesparquet`; > ++ > | c5 | > ++ > | {"bag":[]} | > | {"bag":[]} | > | {"bag":[{"array_element":1},{"array_element":2}]} | > ++ > 3 rows selected (0.085 seconds) > While from hive > hive> select c5 from alltypesparquet; > OK > NULL > NULL > [1,2] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-1999) Drill should expose the Parquet logical schema rather than the physical schema
[ https://issues.apache.org/jira/browse/DRILL-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko resolved DRILL-1999. - Resolution: Fixed Fixed in scope of DRILL-7268. > Drill should expose the Parquet logical schema rather than the physical schema > -- > > Key: DRILL-1999 > URL: https://issues.apache.org/jira/browse/DRILL-1999 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Fix For: Future > > Attachments: hive_alltypes.parquet > > > Created a parquet file in hive having the following DDL > hive> desc alltypesparquet; > OK > c1 int > c2 boolean > c3 double > c4 string > c5 array > c6 map > c7 map > c8 struct > c9 tinyint > c10 smallint > c11 float > c12 bigint > c13 array> > c15 struct> > c16 array,n:int>> > Time taken: 0.076 seconds, Fetched: 15 row(s) > column5 which is an array of integers shows up as a bag when querying through > drill > 0: jdbc:drill:> select c5 from `/user/hive/warehouse/alltypesparquet`; > ++ > | c5 | > ++ > | {"bag":[]} | > | {"bag":[]} | > | {"bag":[{"array_element":1},{"array_element":2}]} | > ++ > 3 rows selected (0.085 seconds) > While from hive > hive> select c5 from alltypesparquet; > OK > NULL > NULL > [1,2] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-1999) Drill should expose the Parquet logical schema rather than the physical schema
[ https://issues.apache.org/jira/browse/DRILL-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko reassigned DRILL-1999: --- Assignee: Igor Guzenko > Drill should expose the Parquet logical schema rather than the physical schema > -- > > Key: DRILL-1999 > URL: https://issues.apache.org/jira/browse/DRILL-1999 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Fix For: Future > > Attachments: hive_alltypes.parquet > > > Created a parquet file in hive having the following DDL > hive> desc alltypesparquet; > OK > c1 int > c2 boolean > c3 double > c4 string > c5 array > c6 map > c7 map > c8 struct > c9 tinyint > c10 smallint > c11 float > c12 bigint > c13 array> > c15 struct> > c16 array,n:int>> > Time taken: 0.076 seconds, Fetched: 15 row(s) > column5 which is an array of integers shows up as a bag when querying through > drill > 0: jdbc:drill:> select c5 from `/user/hive/warehouse/alltypesparquet`; > ++ > | c5 | > ++ > | {"bag":[]} | > | {"bag":[]} | > | {"bag":[{"array_element":1},{"array_element":2}]} | > ++ > 3 rows selected (0.085 seconds) > While from hive > hive> select c5 from alltypesparquet; > OK > NULL > NULL > [1,2] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7405) Build fails due to inaccessible apache-drill on S3 storage
[ https://issues.apache.org/jira/browse/DRILL-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955890#comment-16955890 ] ASF GitHub Bot commented on DRILL-7405: --- vvysotskyi commented on issue #1874: DRILL-7405: Avoiding download of TPC-H data URL: https://github.com/apache/drill/pull/1874#issuecomment-544422550 @paul-rogers, yes, these files are used in unit tests mostly in the `java-exec` module. Currently, `contrib/data/tpch-sample-data` is built before `exec/Java Execution Engine` so there shouldn't be any problems. The main reason for my proposal to use JitPack was to preserve existing behavior, and as the side effects to avoid expanding project source files size and do not complicate the version control system life. If most of the opinions agreed that these files are small enough, let's continue with the current approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Build fails due to inaccessible apache-drill on S3 storage > -- > > Key: DRILL-7405 > URL: https://issues.apache.org/jira/browse/DRILL-7405 > Project: Apache Drill > Issue Type: Task > Components: Tools, Build & Test >Affects Versions: 1.16.0 >Reporter: Boaz Ben-Zvi >Assignee: Abhishek Girish >Priority: Critical > Fix For: 1.17.0 > > > A new clean build (e.g. after deleting the ~/.m2 local repository) would > fail now due to: > Access denied to: > [http://apache-drill.s3.amazonaws.com|https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Ddrill.s3.amazonaws.com_files_sf-2D0.01-5Ftpc-2Dh-5Fparquet-5Ftyped.tgz&d=DwMGaQ&c=C5b8zRQO1miGmBeVZ2LFWg&r=KLC1nKJ8dIOnUay2kR6CAw&m=08mf7Xfn1orlbAA60GKLIuj_PTtfaSAijrKDLOucMPU&s=CX97We3sm3ZZ_aVJIrsUdXVJ3CNMYg7p3IsxbJpuXWk&e=] > > (e.g., for the test data sf-0.01_tpc-h_parquet_typed.tgz ) > A new publicly available storage place is needed, plus appropriate changes in > Drill to get to these resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7177) Format Plugin for Excel Files
[ https://issues.apache.org/jira/browse/DRILL-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955880#comment-16955880 ] ASF GitHub Bot commented on DRILL-7177: --- arina-ielchiieva commented on pull request #1749: DRILL-7177: Format Plugin for Excel Files URL: https://github.com/apache/drill/pull/1749#discussion_r336901086 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -0,0 +1,444 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.excel; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.resultSet.ResultSetLoader; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.MetadataUtils; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.CellValue; +import org.apache.poi.ss.usermodel.DateUtil; +import org.apache.poi.ss.usermodel.FormulaEvaluator; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.xssf.usermodel.XSSFSheet; +import org.apache.poi.xssf.usermodel.XSSFWorkbook; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.joda.time.Instant; +import java.util.Iterator; +import java.io.IOException; +import java.util.ArrayList; + +public class ExcelBatchReader implements ManagedReader { + private ExcelReaderConfig readerConfig; + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class); + + private static final String SAFE_WILDCARD = "_$"; + + private static final String SAFE_SEPARATOR = "_"; + + private static final String PARSER_WILDCARD = ".*"; + + private static final String HEADER_NEW_LINE_REPLACEMENT = "__"; + + private static final String MISSING_FIELD_NAME_HEADER = "field_"; + + private XSSFSheet sheet; + + private XSSFWorkbook workbook; + + private FSDataInputStream fsStream; Review comment: ```suggestion private InputStream fsStream; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Format Plugin for Excel Files > - > > Key: DRILL-7177 > URL: https://issues.apache.org/jira/browse/DRILL-7177 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Labels: doc-impacting > Fix For: 1.17.0 > > > This pull request adds the functionality which enables Drill to query > Microsoft Excel files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7177) Format Plugin for Excel Files
[ https://issues.apache.org/jira/browse/DRILL-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955879#comment-16955879 ] ASF GitHub Bot commented on DRILL-7177: --- arina-ielchiieva commented on pull request #1749: DRILL-7177: Format Plugin for Excel Files URL: https://github.com/apache/drill/pull/1749#discussion_r336900967 ## File path: contrib/format-excel/src/main/java/org/apache/drill/exec/store/excel/ExcelBatchReader.java ## @@ -0,0 +1,444 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.store.excel; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.resultSet.ResultSetLoader; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.MetadataUtils; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.CellValue; +import org.apache.poi.ss.usermodel.DateUtil; +import org.apache.poi.ss.usermodel.FormulaEvaluator; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.xssf.usermodel.XSSFSheet; +import org.apache.poi.xssf.usermodel.XSSFWorkbook; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.joda.time.Instant; +import java.util.Iterator; +import java.io.IOException; +import java.util.ArrayList; + +public class ExcelBatchReader implements ManagedReader { + private ExcelReaderConfig readerConfig; + + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ExcelBatchReader.class); + + private static final String SAFE_WILDCARD = "_$"; + + private static final String SAFE_SEPARATOR = "_"; + + private static final String PARSER_WILDCARD = ".*"; + + private static final String HEADER_NEW_LINE_REPLACEMENT = "__"; + + private static final String MISSING_FIELD_NAME_HEADER = "field_"; + + private XSSFSheet sheet; + + private XSSFWorkbook workbook; + + private FSDataInputStream fsStream; + + private FormulaEvaluator evaluator; + + private ArrayList excelFieldNames; + + private ArrayList columnWriters; + + private Iterator rowIterator; + + private RowSetLoader rowWriter; + + private int totalColumnCount; + + private int lineCount; + + private boolean firstLine; + + private FileSplit split; + + private ResultSetLoader loader; + + private int recordCount; + + public static class ExcelReaderConfig { +protected final ExcelFormatPlugin plugin; + +protected final int headerRow; + +protected final int lastRow; + +protected final int firstColumn; + +protected final int lastColumn; + +protected final boolean readAllFieldsAsVarChar; + +protected String sheetName; + +public ExcelReaderConfig(ExcelFormatPlugin plugin) { + this.plugin = plugin; + headerRow = plugin.getConfig().getHeaderRow(); + lastRow = plugin.getConfig().getLastRow(); + firstColumn = plugin.getConfig().getFirstColumn(); + lastColumn = plugin.getConfig().getLastColumn(); + readAllFieldsAsVarChar = plugin.getConfig().getReadAllFieldsAsVarChar(); + sheetName = plugin.getConfig().getSheetName(); +} + } + + public ExcelBatchReader(ExcelReaderConfig readerConfig) { +this.readerConfig = readerConfig; +firstLine = true; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +verifyConfigOptions(); +split = negotiator.split(); +loader = negotiator.build(); +rowWriter = loa
[jira] [Commented] (DRILL-4303) ESRI Shapefile (shp) format plugin
[ https://issues.apache.org/jira/browse/DRILL-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955877#comment-16955877 ] ASF GitHub Bot commented on DRILL-4303: --- arina-ielchiieva commented on pull request #1858: DRILL-4303: ESRI Shapefile (shp) Format Plugin URL: https://github.com/apache/drill/pull/1858#discussion_r336899443 ## File path: contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java ## @@ -0,0 +1,334 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.esri; + +import com.esri.core.geometry.Geometry; +import com.esri.core.geometry.GeometryCursor; +import com.esri.core.geometry.ShapefileReader; +import com.esri.core.geometry.SpatialReference; +import com.esri.core.geometry.ogc.OGCGeometry; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.resultSet.ResultSetLoader; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.MetadataUtils; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.FileSplit; +import org.jamel.dbf.DbfReader; +import org.jamel.dbf.structure.DbfField; +import org.joda.time.Instant; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.charset.Charset; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class ShpBatchReader implements ManagedReader { + + private FileSplit split; + private BufferedReader reader; + private ResultSetLoader loader; + private ShpReaderConfig readerConfig; + private Path hadoopShp; + private Path hadoopDbf; + private Path hadoopPrj; + private FSDataInputStream fileReaderShp = null; + private FSDataInputStream fileReaderDbf = null; + private FSDataInputStream fileReaderPrj = null; + private GeometryCursor geomCursor = null; + private DbfReader dbfReader = null; + private ScalarWriter gidWriter; + private ScalarWriter sridWriter; + private ScalarWriter shapeTypeWriter; + private ScalarWriter geomWriter; + private RowSetLoader rowWriter; + + + private int srid; + private SpatialReference spatialReference; + private static final Logger logger = LoggerFactory.getLogger(ShpBatchReader.class); + + public static class ShpReaderConfig { +protected final ShpFormatPlugin plugin; + +public ShpReaderConfig(ShpFormatPlugin plugin) { + this.plugin = plugin; +} + } + + public ShpBatchReader(ShpReaderConfig readerConfig) { +this.readerConfig = readerConfig; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +this.split = negotiator.split(); +this.hadoopShp = split.getPath(); +this.hadoopDbf = new Path(split.getPath().toString().replace("shp", "dbf")); +this.hadoopPrj = new Path(split.getPath().toString().replace("shp", "prj")); + +openFile(negotiator); +SchemaBuilder builder = new SchemaBuilder(); +builder.addNullable("gid", TypeProtos.MinorType.INT); +builder.addNullable("srid", TypeProtos.MinorType.INT); +builder.addNullable("shapeType", TypeProtos.MinorType.VARCHAR); +builder.addNullable("geom", TypeProtos.MinorType.VARBINARY); + +negotiator.setTableSchema(builder.buildSchema(), false); +loader = negotiator.build(); + +rowWriter = loader.writer(); +gidWriter = rowWriter.scalar("gid"); +sridWriter = rowWriter.scalar("srid"); +shapeTypeWriter = rowWriter.scalar("shapeType"); +geomWriter = rowWriter.scalar("geom"); + +return true; + }
[jira] [Commented] (DRILL-4303) ESRI Shapefile (shp) format plugin
[ https://issues.apache.org/jira/browse/DRILL-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955876#comment-16955876 ] ASF GitHub Bot commented on DRILL-4303: --- arina-ielchiieva commented on pull request #1858: DRILL-4303: ESRI Shapefile (shp) Format Plugin URL: https://github.com/apache/drill/pull/1858#discussion_r336899443 ## File path: contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java ## @@ -0,0 +1,334 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.esri; + +import com.esri.core.geometry.Geometry; +import com.esri.core.geometry.GeometryCursor; +import com.esri.core.geometry.ShapefileReader; +import com.esri.core.geometry.SpatialReference; +import com.esri.core.geometry.ogc.OGCGeometry; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.types.TypeProtos; +import org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.FileSchemaNegotiator; +import org.apache.drill.exec.physical.impl.scan.framework.ManagedReader; +import org.apache.drill.exec.physical.resultSet.ResultSetLoader; +import org.apache.drill.exec.physical.resultSet.RowSetLoader; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.MetadataUtils; +import org.apache.drill.exec.record.metadata.SchemaBuilder; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.mapred.FileSplit; +import org.jamel.dbf.DbfReader; +import org.jamel.dbf.structure.DbfField; +import org.joda.time.Instant; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.charset.Charset; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class ShpBatchReader implements ManagedReader { + + private FileSplit split; + private BufferedReader reader; + private ResultSetLoader loader; + private ShpReaderConfig readerConfig; + private Path hadoopShp; + private Path hadoopDbf; + private Path hadoopPrj; + private FSDataInputStream fileReaderShp = null; + private FSDataInputStream fileReaderDbf = null; + private FSDataInputStream fileReaderPrj = null; + private GeometryCursor geomCursor = null; + private DbfReader dbfReader = null; + private ScalarWriter gidWriter; + private ScalarWriter sridWriter; + private ScalarWriter shapeTypeWriter; + private ScalarWriter geomWriter; + private RowSetLoader rowWriter; + + + private int srid; + private SpatialReference spatialReference; + private static final Logger logger = LoggerFactory.getLogger(ShpBatchReader.class); + + public static class ShpReaderConfig { +protected final ShpFormatPlugin plugin; + +public ShpReaderConfig(ShpFormatPlugin plugin) { + this.plugin = plugin; +} + } + + public ShpBatchReader(ShpReaderConfig readerConfig) { +this.readerConfig = readerConfig; + } + + @Override + public boolean open(FileSchemaNegotiator negotiator) { +this.split = negotiator.split(); +this.hadoopShp = split.getPath(); +this.hadoopDbf = new Path(split.getPath().toString().replace("shp", "dbf")); +this.hadoopPrj = new Path(split.getPath().toString().replace("shp", "prj")); + +openFile(negotiator); +SchemaBuilder builder = new SchemaBuilder(); +builder.addNullable("gid", TypeProtos.MinorType.INT); +builder.addNullable("srid", TypeProtos.MinorType.INT); +builder.addNullable("shapeType", TypeProtos.MinorType.VARCHAR); +builder.addNullable("geom", TypeProtos.MinorType.VARBINARY); + +negotiator.setTableSchema(builder.buildSchema(), false); +loader = negotiator.build(); + +rowWriter = loader.writer(); +gidWriter = rowWriter.scalar("gid"); +sridWriter = rowWriter.scalar("srid"); +shapeTypeWriter = rowWriter.scalar("shapeType"); +geomWriter = rowWriter.scalar("geom"); + +return true; + }