[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121418#comment-16121418
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132406973
  
--- Diff: exec/java-exec/src/test/java/org/apache/drill/BaseTestQuery.java 
---
@@ -639,14 +644,18 @@ protected static void 
copyDirectoryIntoTempSpace(String resourcesDir, String des
* @param srcFileOnClassPath the source path of metadata cache file, 
which should be replaced
* @param destFolderInTmp  the parent folder name of the metadata cache 
file
* @param metaFileName the name of metadata cache file depending on the 
type of the metadata
+   * @param customStringReplacement custom string to replace the 
"CUSTOM_REPLACED" target string in metadata file
* @throws IOException if a create or write errors occur
*/
-  protected static void 
copyMetaDataCacheToTempReplacingInternalPaths(String srcFileOnClassPath, String 
destFolderInTmp,
-  String metaFileName) throws IOException {
+  protected static void copyMetaDataCacheToTempWithReplacements(String 
srcFileOnClassPath,
+  String destFolderInTmp, String metaFileName, String 
customStringReplacement) throws IOException {
 String metadataFileContents = getFile(srcFileOnClassPath);
 Path rootMeta = new Path(dfsTestTmpSchemaLocation, destFolderInTmp);
 Path newMetaCache = new Path(rootMeta, metaFileName);
 FSDataOutputStream outSteam = fs.create(newMetaCache);
+if (customStringReplacement != null) {
+  metadataFileContents = 
metadataFileContents.replace("CUSTOM_STRING_REPLACEMENT", 
customStringReplacement);
+}
 outSteam.writeBytes(metadataFileContents.replace("REPLACED_IN_TEST", 
dfsTestTmpSchemaLocation));
 outSteam.close();
--- End diff --

Please use try-with-resources to ensure that stream will be closed in any 
case.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121407#comment-16121407
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132406353
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetMetadataCache.java
 ---
@@ -452,33 +458,202 @@ public void testMoveCache() throws Exception {
 
   @Test
   public void testMetadataCacheAbsolutePaths() throws Exception {
+final String absolutePathsMetadata = "absolute_paths_metadata";
 try {
   test("use dfs_test.tmp");
-  final String relative_path_metadata_t1 = RELATIVE_PATHS_METADATA + 
"/t1";
-  final String relative_path_metadata_t2 = RELATIVE_PATHS_METADATA + 
"/t2";
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t1);
-  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
relative_path_metadata_t2);
+  // creating two inner directories to leverage 
METADATA_DIRECTORIES_FILENAME metadata file as well
+  final String absolutePathsMetadataT1 = absolutePathsMetadata + "/t1";
+  final String absolutePathsMetadataT2 = absolutePathsMetadata + "/t2";
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT1);
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
absolutePathsMetadataT2);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_directories_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_DIRECTORIES_FILENAME);
+  "metadata_directories_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_DIRECTORIES_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths.requires_replace.txt", 
RELATIVE_PATHS_METADATA, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths.requires_replace.txt", 
absolutePathsMetadata, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
relative_path_metadata_t1, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t1.requires_replace.txt", 
absolutePathsMetadataT1, Metadata.METADATA_FILENAME);
   
copyMetaDataCacheToTempReplacingInternalPaths("parquet/metadata_with_absolute_path/"
 +
-  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
relative_path_metadata_t2, Metadata.METADATA_FILENAME);
+  "metadata_table_with_absolute_paths_t2.requires_replace.txt", 
absolutePathsMetadataT2, Metadata.METADATA_FILENAME);
+  String query = String.format("select * from %s", 
absolutePathsMetadata);
+  int expectedRowCount = 50;
+  int expectedNumFiles = 1; // point to selectionRoot since no pruning 
is done in this query
+  int actualRowCount = testSql(query);
+  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files",
+  expectedRowCount, actualRowCount);
+  String numFilesPattern = "numFiles=" + expectedNumFiles;
+  String usedMetaPattern = "usedMetadataFile=true";
+  String cacheFileRootPattern = String.format("cacheFileRoot=%s/%s", 
getDfsTestTmpSchemaLocation(), absolutePathsMetadata);
+  PlanTestBase.testPlanMatchingPatterns(query, new 
String[]{numFilesPattern, usedMetaPattern, cacheFileRootPattern},
+  new String[] {"Filter"});
+} finally {
+  test("drop table if exists %s", absolutePathsMetadata);
+}
+  }
 
-  int rowCount = testSql(String.format("select * from %s", 
RELATIVE_PATHS_METADATA));
-  assertEquals("An incorrect result was obtained while querying a 
table with metadata cache files", 50, rowCount);
+  @Test
+  public void testSpacesInMetadataCachePath() throws Exception {
+final String pathWithSpaces = "path with spaces";
+try {
+  test("use dfs_test.tmp");
+  // creating multilevel table to store path with spaces in both 
metadata files (METADATA and METADATA_DIRECTORIES)
+  test("create table `%s` as select * from cp.`tpch/nation.parquet`", 
pathWithSpaces);
+  test("create table `%1$s/%1$s` as select * from 
cp.`tpch/nation.parquet`", pathWithSpaces);
+  test("refresh table metadata `%s`", pathWithSpaces);
+  checkForMetadataFile(pathWithSpaces);
+  S

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121409#comment-16121409
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132414323
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -209,9 +211,10 @@ public ParquetGroupScan( //
 this.entries = Lists.newArrayList();
 if (fileSelection.getMetaContext() != null &&
 (fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_STARTED ||
-  fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_PRUNED)) {
-  // if pruning was not applicable or was attempted and nothing was 
pruned, initialize the
-  // entries with just the selection root instead of the fully 
expanded list to reduce overhead.
+  fileSelection.getMetaContext().getPruneStatus() == 
PruneStatus.NOT_PRUNED ||
+
fileSelection.getMetaContext().isMetadataFilesMissingOrCorrupted)) {
--- End diff --

There are 4 checks in `if` statement, may be we can factor out them in 
separate method?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Th

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121415#comment-16121415
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132408506
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -132,25 +140,62 @@ public static ParquetTableMetadata_v3 
getParquetTableMetadata(FileSystem fs,
   }
 
   /**
-   * Get the parquet metadata for a directory by reading the metadata file
+   * Get the parquet metadata for the table by reading the metadata file
*
-   * @param fs
+   * @param fs current file system
* @param path The path to the metadata file, located in the directory 
that contains the parquet files
-   * @return
-   * @throws IOException
+   * @param metaContext metadata context
+   * @param formatConfig parquet format plugin configs
+   * @return parquet table metadata
+   * @throws IOException if metadata file can't be read or updated
*/
-  public static ParquetTableMetadataBase readBlockMeta(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  public static @Nullable ParquetTableMetadataBase 
readBlockMeta(FileSystem fs, Path path, MetadataContext metaContext,
+  ParquetFormatConfig formatConfig) {
+if (ignoreReadingMetadata(metaContext, path)) {
+  return null;
+}
 Metadata metadata = new Metadata(fs, formatConfig);
 metadata.readBlockMeta(path, false, metaContext);
 return metadata.parquetTableMetadata;
   }
 
-  public static ParquetTableMetadataDirs readMetadataDirs(FileSystem fs, 
String path, MetadataContext metaContext, ParquetFormatConfig formatConfig) 
throws IOException {
+  /**
+   * Get the parquet metadata for all subdirectories by reading the 
metadata file
+   *
--- End diff --

Please add note explaining the case when we would return null value.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating Parqu

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121410#comment-16121410
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132415968
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -1123,4 +1163,22 @@ public GroupScan applyFilter(LogicalExpression 
filterExpr, UdfUtilities udfUtili
 }
   }
 
+  /**
+   * Ignore expanding selection, if metadata is corrupted
--- End diff --

Please add comment that you not only check should we ignore expanding 
selection but also setting `parquetTableMetadata` and `fileSet` and 
`selection.revertDirStatuses()`.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> chang

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121414#comment-16121414
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132408285
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -188,6 +188,10 @@ public void setExpandedFully() {
 this.dirStatus = StatusType.EXPANDED_FULLY;
   }
 
+  public void revertDirStatuses() {
--- End diff --

Can you please add javadoc explaining cases when we would revert directory 
statuses.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a new metadata structure (class) should be created 
> (possible an improvem

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121419#comment-16121419
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132411564
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -633,13 +645,22 @@ public long getRowCount() {
* @throws UserException when the updated selection is empty, this 
happens if the user selects an empty folder.
*/
   private FileSelection
-  initFromMetadataCache(FileSelection selection, Path metaFilePath) throws 
IOException {
+  expandSelectionFromMetadataCache(FileSelection selection, Path 
metaFilePath) throws IOException {
 // get the metadata for the root directory by reading the metadata file
 // parquetTableMetadata contains the metadata for all files in the 
selection root folder, but we need to make sure
 // we only select the files that are part of selection (by setting 
fileSet appropriately)
 
+MetadataContext metaContext = selection.getMetaContext();
+// make sure that a metadata context is created since we are going to 
use metadata caching
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+  selection.setMetaContext(metaContext);
+}
 // get (and set internal field) the metadata for the directory by 
reading the metadata file
-this.parquetTableMetadata = Metadata.readBlockMeta(fs, 
metaFilePath.toString(), selection.getMetaContext(), formatConfig);
+parquetTableMetadata = Metadata.readBlockMeta(fs, metaFilePath, 
metaContext, formatConfig);
+if (ignoreExpandingSelection(parquetTableMetadata, selection)) {
+  return selection;
+}
 if (formatConfig.autoCorrectCorruptDates) {
--- End diff --

Please add / use getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121422#comment-16121422
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132410745
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -974,8 +1005,17 @@ private int updateRowGroupInfo(long maxRecords) {
   public ParquetGroupScan clone(FileSelection selection) throws 
IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.setCacheFileRoot(selection.cacheFileRoot);
-newScan.init(selection.getMetaContext());
+MetadataContext metaContext = selection.getMetaContext();
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+}
+if (newScan.usedMetadataCache) {
+  newScan.setCacheFileRoot(selection.cacheFileRoot);
--- End diff --

Please use getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. The

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121412#comment-16121412
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132407801
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -222,15 +222,17 @@ public DrillTable isReadable(DrillFileSystem fs, 
FileSelection selection,
   // create a metadata context that will be used for the duration 
of the query for this table
   MetadataContext metaContext = new MetadataContext();
 
-  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath.toString(), metaContext, formatConfig);
-  if (mDirs.getDirectories().size() > 0) {
-FileSelection dirSelection = 
FileSelection.createFromDirectories(mDirs.getDirectories(), selection,
-selection.getSelectionRoot() /* cacheFileRoot initially 
points to selectionRoot */);
-dirSelection.setExpandedPartial();
-dirSelection.setMetaContext(metaContext);
-
-return new DynamicDrillTable(fsPlugin, storageEngineName, 
userName,
-new FormatSelection(plugin.getConfig(), dirSelection));
+  ParquetTableMetadataDirs mDirs = Metadata.readMetadataDirs(fs, 
dirMetaPath, metaContext, formatConfig);
+  if (mDirs != null) {
+if (mDirs.getDirectories().size() > 0) {
--- End diff --

Can we combine two `if` statements into one?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in Metada

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121420#comment-16121420
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132405399
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -268,11 +276,15 @@ private FileSelection expandIfNecessary(FileSelection 
selection) throws IOExcept
 // use the cacheFileRoot if provided (e.g after partition pruning)
 Path metaFilePath = new Path(cacheFileRoot != null ? cacheFileRoot : 
selectionRoot, Metadata.METADATA_FILENAME);
 if (!fs.exists(metaFilePath)) { // no metadata cache
+  MetadataContext metaContext = selection.getMetaContext();
+  if (metaContext != null) {
+// some metadata files are absent, but some are present (since 
metadata context was created)
+metaContext.isMetadataFilesMissingOrCorrupted = true;
--- End diff --

Please use setter & getter instead of direct assignment / access.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121416#comment-16121416
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132414700
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -519,17 +564,16 @@ private void writeFile(ParquetTableMetadataDirs 
parquetTableMetadataDirs, Path p
   /**
* Read the parquet metadata from a file
*
-   * @param path
-   * @return
-   * @throws IOException
+   * @param path to metadata file
+   * @param dirsOnly true for {@link 
Metadata#METADATA_DIRECTORIES_FILENAME}
+   * or false for {@link Metadata#METADATA_FILENAME} files 
reading
+   * @param metaContext current metadata context
+   * @throws IOException if metadata file can't be read or updated
*/
-  private void readBlockMeta(String path,
-  boolean dirsOnly,
-  MetadataContext metaContext) throws IOException {
+  private void readBlockMeta(Path path, boolean dirsOnly, MetadataContext 
metaContext) {
 Stopwatch timer = Stopwatch.createStarted();
-Path p = new Path(path);
-Path parentDir = Path.getPathWithoutSchemeAndAuthority(p.getParent()); 
// parent directory of the metadata file
-String parentDirString = parentDir.toUri().toString(); // string 
representation for parent directory of the metadata file
+Path parentDir = 
Path.getPathWithoutSchemeAndAuthority(path.getParent()); // parent directory of 
the metadata file
+String parentDirString = parentDir.toUri().getPath(); // string 
representation for parent directory of the metadata file
--- End diff --

`parentDirPath`


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121413#comment-16121413
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132410703
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -974,8 +1005,17 @@ private int updateRowGroupInfo(long maxRecords) {
   public ParquetGroupScan clone(FileSelection selection) throws 
IOException {
 ParquetGroupScan newScan = new ParquetGroupScan(this);
 newScan.modifyFileSelection(selection);
-newScan.setCacheFileRoot(selection.cacheFileRoot);
-newScan.init(selection.getMetaContext());
+MetadataContext metaContext = selection.getMetaContext();
+if (metaContext == null) {
+  metaContext = new MetadataContext();
+}
+if (newScan.usedMetadataCache) {
--- End diff --

Please add / use getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version f

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121421#comment-16121421
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132405168
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/MetadataVersion.java
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.parquet;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ComparisonChain;
+import com.google.common.collect.ImmutableSortedSet;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+
+import java.util.SortedSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+
+public class MetadataVersion implements Comparable {
+
+  private static final String FORMAT = "v?((?!0)\\d+)(\\.(\\d+))?";
+  private static final Pattern PATTERN = Pattern.compile(FORMAT);
+
+  private final int major;
+  private final int minor;
+
+  public MetadataVersion(int major, int minor) {
+this.major = major;
+this.minor = minor;
+  }
+
+  public MetadataVersion(String metadataVersion) {
+Matcher matcher = PATTERN.matcher(metadataVersion);
+if (!matcher.matches()) {
+  DrillRuntimeException.format("Could not parse metadata version '%s' 
using format '%s'", metadataVersion, FORMAT);
+}
+this.major = Integer.parseInt(matcher.group(1));
+this.minor = matcher.group(3) != null ? 
Integer.parseInt(matcher.group(3)) : 0;
+  }
+
+  public int getMajor() {
+return major;
+  }
+
+  public int getMinor() {
+return minor;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
+  return true;
+}
+if (!(o instanceof MetadataVersion)) {
+  return false;
+}
+MetadataVersion that = (MetadataVersion) o;
+return this.major == that.major
+&& this.minor == that.minor;
+  }
+
+  @Override
+  public int hashCode() {
+int result = major;
+result = 31 * result + minor;
+return result;
+  }
+
+  /**
+   * @return string representation of the metadata file version, for 
example: "v1", "v10", "v4.13"
+   * 
+   * String metadata version consists of the following characters:
+   * optional "v" letter,
--- End diff --

In our case `toString()` method won't return result with "v", so I guess we 
can remove this sentence.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read 

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121408#comment-16121408
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132413275
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -268,11 +276,15 @@ private FileSelection expandIfNecessary(FileSelection 
selection) throws IOExcept
 // use the cacheFileRoot if provided (e.g after partition pruning)
 Path metaFilePath = new Path(cacheFileRoot != null ? cacheFileRoot : 
selectionRoot, Metadata.METADATA_FILENAME);
 if (!fs.exists(metaFilePath)) { // no metadata cache
+  MetadataContext metaContext = selection.getMetaContext();
--- End diff --

It seems you set metadata context, when there can be the cases when it is 
still null? Also can you please explain why there are so many checks in the 
code if meta context is null, can we make sure it always exists?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same beh

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121411#comment-16121411
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132409360
  
--- Diff: 
exec/java-exec/src/test/resources/parquet/unsupported_metadata/corrupted_metadata.requires_replace.txt
 ---
@@ -0,0 +1,41 @@
+{
+  "metadata_version" : "v3",
+  "columnTypeInfo"
+
+  CORRUPTED DATA THAT LEADS TO JsonParseException
+
+"path" : "0_0_0.parquet",
+"length" : 2424,
+"rowGroups" : [ {
+  "start" : 4,
+  "length" : 1802,
+  "rowCount" : 25,
+  "hostAffinity" : {
+"localhost" : 1.0
+  },
+  "columns" : [ {
+"name" : [ "n_nationkey" ],
+"minValue" : 0,
+"maxValue" : 24,
+"nulls" : 0
+  }, {
+"name" : [ "n_name" ],
+"minValue" : "ALGERIA",
+"maxValue" : "VIETNAM",
+"nulls" : 0
+  }, {
+"name" : [ "n_regionkey" ],
+"minValue" : 0,
+"maxValue" : 4,
+"nulls" : 0
+  }, {
+"name" : [ "n_comment" ],
+"minValue" : " haggle. carefully final deposits detect slyly agai",
+"maxValue" : "y final packages. slow foxes cajole quickly. quickly 
silent platelets breach ironic accounts. unusual pinto be",
+"nulls" : 0
+  } ]
+} ]
+  } ],
+  "directories" : [ ],
+  "drillVersion" : "1.11.0-SNAPSHOT"
--- End diff --

May be to Drill version without snapshot here and below?


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed with

[jira] [Commented] (DRILL-5660) Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121417#comment-16121417
 ] 

ASF GitHub Bot commented on DRILL-5660:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/877#discussion_r132408041
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/MetadataContext.java
 ---
@@ -41,6 +42,9 @@
 
   private PruneStatus pruneStatus = PruneStatus.NOT_STARTED;
 
+  // Setting this value as true allows to avoid double reading of 
corrupted, unsupported or missing metadata files
+  public boolean isMetadataFilesMissingOrCorrupted;
--- End diff --

Please make private and add setter and getter.


> Drill 1.10 queries fail due to Parquet Metadata "corruption" from DRILL-3867
> 
>
> Key: DRILL-5660
> URL: https://issues.apache.org/jira/browse/DRILL-5660
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Paul Rogers
>Assignee: Vitalii Diravka
>  Labels: doc-impacting
> Fix For: 1.12.0
>
>
> Drill recently accepted a PR for the following bug:
> DRILL-3867: Store relative paths in metadata file
> This PR turned out to have a flaw which affects version compatibility.
> The DRILL-3867 PR changed the format of Parquet metadata files to store 
> relative paths. All Drill servers after that PR create files with relative 
> paths. But, the version number of the file is unchanged, so that older 
> Drillbits don't know that they can't understand the file.
> Instead, if an older Drill tries to read the file, queries fail left and 
> right. Drill will resolve the paths, but does so relative to the user's HDFS 
> home directory, which is wrong.
> What should have happened is that we should have bumped the parquet metadata 
> file version number so that older Drillbits can’t read the file. This ticket 
> requests that we do that.
> Now, one could argue that the lack of version number change is fine. Once a 
> user upgrades Drill, they won't use an old Drillbit. But, things are not that 
> simple:
> * A developer tests a branch based on a pre-DRILL-3867 build on a cluster in 
> which metadata files have been created by a post-DRILL-3867 build. (This has 
> already occurred multiple times in our shop.)
> * A user tries to upgrade to Drill 1.11, finds an issue, and needs to roll 
> back to Drill 1.10. Doing so will cause queries to fail due to 
> seemingly-corrupt metadata files.
> * A user tries to do a rolling upgrade: running 1.11 on some servers, 1.10 on 
> others. Once a 1.11 server is installed, the metadata is updated ("corrupted" 
> from the perspective of 1.10) and queries fail.
> Standard practice in this scenario is to:
> * Bump the file version number when the file format changes, and
> * Software refuses to read files with a version newer than the software was 
> designed for.
> Of course, it is highly desirable that newer servers read old files, but that 
> is not the issue here.
> *Main technical points of working of parquet metadata caching for now.*
> Only process of reading the parquet metadata is changed (the process of 
> writing isn't changed):
> +1. Metadata files are valid:+
> Metadata objects are created by deserialization of parquet metadata files in 
> the process of creating ParquetGroupScan physical operator. 
> All supported versions are stored in the "MetadataVersion.Constants" class 
> and in the Jackson annotations for Metadata.ParquetTableMetadataBase class.
> +2. Metadata files version isn't supported (created by newer Drill version). 
> Drill table has at least one metadata file of unsupported version:+
> JsonMappingException is obtained and swallowed without creating metadata 
> object. Error message is logged. The state is stored in MetadataContext, 
> therefore further there will be no attempt to deserialize metadata file again 
> in context of performing current query. The physical plan will be created 
> without using parquet metadata caching. Warning message is logged for every 
> further check "is metadata corrupted".
> +3. Drill table has at least one corrupted metadata file, which can't be 
> deserialized:+
> JsonParseException is obtained. Then the same behaviour as for the 
> unsupported version files.
> +4. The metadata file was removed by other process:+
> FileNotFound is obtained. Then the same behaviour as for the unsupported 
> version files.
> The new versions of metadata should be added in such manner:
> 1. Increasing of the metadata major version if metadata structure is changed.
> 2. Increasing of the metadata minor version if only metadata content is 
> changed, but metadata structure is the same.
> For the first case a

[jira] [Commented] (DRILL-1162) 25 way join ended up with OOM

2017-08-10 Thread Volodymyr Vysotskyi (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121442#comment-16121442
 ] 

Volodymyr Vysotskyi commented on DRILL-1162:


I was not able to receive out of heap memory for the query, so I could not 
determine its reasons.
As for the direct memory, as I wrote in the earlier comments, that the main 
reason is wrong row count estimation when joining by PR. 
Let's consider memory using for hash join operator. Information about direct 
memory using was taken from the query profiles. 
For example for queries similar to this query 
{code:sql}
select count(*) from k
inner join l on k.l_partkey=l.l_partkey
inner join m on m.l_partkey=l.l_partkey
inner join n on m.l_partkey=n.l_partkey
inner join o on n.l_partkey=o.l_partkey
inner join p on p.l_partkey=o.l_partkey;
{code}
Hash join operator uses such memory size:
||join||build side row count||Max Peak Memory for HASH_JOIN||
|k and l|1|1MB|
|m and (k and l)|109,920|2MB|
|n and (m and (k and l))|1,867,062|16MB|
|o and (n and (m and (k and l)))|43,037,076|338MB|
|o and (n and (m1 and (k and l)))|49,446,827|388MB|
|o and (n1 and (m1 and (k and l)))|54,949,346|431MB|
|q and (p and (m and (k and l)))|61,771,300|484MB|
|p and (o and (n and (k and m)))|99,483,263|778MB|
|s and (q and (p and (k and o)))|330,512,646|2.52GB|
m1, n1, q and s tables are created by queries
{code:sql}
create table n1 as select * from `lineitem1.parquet` limit 45020;
create table m1 as select * from `lineitem1.parquet` limit 35010;
create table q as (select l_partkey from `lineitem1.parquet` union all select 
l_partkey from `lineitem1.parquet` limit 70050);
create table s as (select l_partkey from `lineitem1.parquet` union all select 
l_partkey from `lineitem1.parquet` limit 90050);
{code}
As you can see from the table, hash join operator uses direct memory in 
proportion to the rows count of the build side. 
For the query, similar to the query from Jira description, the ratio of memory 
used by the hash join operator to build side row count almost the same for the 
simplified query (it is executed without OOM)
{code:sql}
select count(*) from `lineitem1.parquet` a
inner join `part.parquet` j on a.l_partkey = j.p_partkey 
inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey 
inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey 
inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey = 
m.ps_suppkey 
inner join `customer.parquet` n on k.o_custkey = n.c_custkey 
inner join `lineitem2.parquet` b on a.l_orderkey = b.l_orderkey 
inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey 
inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice;
{code}
and for query that fails:
{code:sql}
select count(*) from `lineitem1.parquet` a
inner join `part.parquet` j on a.l_partkey = j.p_partkey 
inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey 
inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey 
inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey = 
m.ps_suppkey 
inner join `customer.parquet` n on k.o_custkey = n.c_custkey 
inner join `lineitem2.parquet` c on a.l_partkey = c.l_partkey 
inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey 
inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice;
{code}
2.08GB / 181,675,567 and 9.02GB / 787,992,000 respectively.

So the size of memory that is used by hash join operator is proportional to the 
row count of build side. That's why the actual issue is the row count 
estimation that causes plan in which right input of hash join operator actually 
has larger rows number than probe side.

Creating or modifying existing planning rules may change plan more 
significantly, so I don't think that we should do this for the current bug. But 
swapping the join inputs only for hash join when the best physical plan was 
chosen should not cause regressions. Besides that this swapping which I am 
proposed will be happening very rare since all conditions that I described in 
my previous comment should be satisfied. 

There is not enough information to make a strict decision about swapping, so we 
only check that the worst case may happen. I was looking for a better decision 
but taking into account the shortage of information about rows count I chose 
this fix.

> 25 way join ended up with OOM
> -
>
> Key: DRILL-1162
> URL: https://issues.apache.org/jira/browse/DRILL-1162
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Flow, Query Planning & Optimization
>Reporter: Rahul Challapalli
>Assignee: Volodymyr Vysotskyi
>Priority: Critical
> Fix For: Future
>
> Attachments: error.log, oom_error.log
>
>
> git.commit.id.abbrev=e5c2da0
> The below query res

[jira] [Reopened] (DRILL-5699) Drill Web UI Page Source Has Links To External Sites

2017-08-10 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reopened DRILL-5699:
-

> Drill Web UI Page Source Has Links To External Sites
> 
>
> Key: DRILL-5699
> URL: https://issues.apache.org/jira/browse/DRILL-5699
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - HTTP
>Reporter: Sindhuri Ramanarayan Rayavaram
>Assignee: Sindhuri Ramanarayan Rayavaram
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Drill uses external CDN for javascript and css files in the result page. When 
> there is no internet connection this page fails to load. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DRILL-5699) Drill Web UI Page Source Has Links To External Sites

2017-08-10 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-5699:
---

Assignee: Sindhuri Ramanarayan Rayavaram  (was: Arina Ielchiieva)

> Drill Web UI Page Source Has Links To External Sites
> 
>
> Key: DRILL-5699
> URL: https://issues.apache.org/jira/browse/DRILL-5699
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - HTTP
>Reporter: Sindhuri Ramanarayan Rayavaram
>Assignee: Sindhuri Ramanarayan Rayavaram
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Drill uses external CDN for javascript and css files in the result page. When 
> there is no internet connection this page fails to load. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DRILL-5699) Drill Web UI Page Source Has Links To External Sites

2017-08-10 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-5699:
---

Assignee: Arina Ielchiieva  (was: Sindhuri Ramanarayan Rayavaram)

> Drill Web UI Page Source Has Links To External Sites
> 
>
> Key: DRILL-5699
> URL: https://issues.apache.org/jira/browse/DRILL-5699
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - HTTP
>Reporter: Sindhuri Ramanarayan Rayavaram
>Assignee: Arina Ielchiieva
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Drill uses external CDN for javascript and css files in the result page. When 
> there is no internet connection this page fails to load. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5699) Drill Web UI Page Source Has Links To External Sites

2017-08-10 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121496#comment-16121496
 ] 

Arina Ielchiieva commented on DRILL-5699:
-

Updated Jira status. To get into weekly batch commits Jira status should be in 
Reviewable and ready-to-commit label present.
After changes are merged into master, Jira status should be changed to 
Resolved, comment with commit id added and PR is closed. 
Usually all this should be done by weekly batch committer if not it can be done 
by the developer.

> Drill Web UI Page Source Has Links To External Sites
> 
>
> Key: DRILL-5699
> URL: https://issues.apache.org/jira/browse/DRILL-5699
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - HTTP
>Reporter: Sindhuri Ramanarayan Rayavaram
>Assignee: Sindhuri Ramanarayan Rayavaram
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Drill uses external CDN for javascript and css files in the result page. When 
> there is no internet connection this page fails to load. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-5699) Drill Web UI Page Source Has Links To External Sites

2017-08-10 Thread Arina Ielchiieva (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121496#comment-16121496
 ] 

Arina Ielchiieva edited comment on DRILL-5699 at 8/10/17 11:28 AM:
---

Updated Jira status. To get into weekly batch commits Jira status should be 
Reviewable and ready-to-commit label present.
After changes are merged into master, Jira status should be changed to 
Resolved, comment with commit id added and PR is closed. 
Usually all this should be done by weekly batch committer if not it can be done 
by the developer.


was (Author: arina):
Updated Jira status. To get into weekly batch commits Jira status should be in 
Reviewable and ready-to-commit label present.
After changes are merged into master, Jira status should be changed to 
Resolved, comment with commit id added and PR is closed. 
Usually all this should be done by weekly batch committer if not it can be done 
by the developer.

> Drill Web UI Page Source Has Links To External Sites
> 
>
> Key: DRILL-5699
> URL: https://issues.apache.org/jira/browse/DRILL-5699
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - HTTP
>Reporter: Sindhuri Ramanarayan Rayavaram
>Assignee: Sindhuri Ramanarayan Rayavaram
>Priority: Minor
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Drill uses external CDN for javascript and css files in the result page. When 
> there is no internet connection this page fails to load. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-5714) Fix NPE when mapr-db plugin is used in table function

2017-08-10 Thread Arina Ielchiieva (JIRA)
Arina Ielchiieva created DRILL-5714:
---

 Summary: Fix NPE when mapr-db plugin is used in table function
 Key: DRILL-5714
 URL: https://issues.apache.org/jira/browse/DRILL-5714
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - MapRDB
Affects Versions: 1.11.0
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva
 Fix For: 1.12.0


Query:
{noformat}
select * from table(mapr.`/tmp/test`(type=>'maprdb', allTextMode=>true))
{noformat}

Error:
{noformat}
2017-08-03 13:13:56,527 [267cde6b-6327-11b4-a4f3-f59a8d82ae17:frag:0:0] ERROR 
o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: NullPointerException

Fragment 0:0

[Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
NullPointerException

Fragment 0:0

[Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
 ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:295)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:264)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_95]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_95]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
Caused by: org.apache.drill.common.exceptions.ExecutionSetupException: 
java.lang.NullPointerException
at 
org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:52)
 ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:36)
 ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:156)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:136)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:109)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) 
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:207)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
... 4 common frames omitted
Caused by: java.lang.NullPointerException: null
at 
org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:46)
 ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
... 12 common frames omitted
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5714) Fix NPE when mapr-db plugin is used in table function

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121538#comment-16121538
 ] 

ASF GitHub Bot commented on DRILL-5714:
---

GitHub user arina-ielchiieva opened a pull request:

https://github.com/apache/drill/pull/902

DRILL-5714: Fix NPE when mapr-db plugin is used in table function



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arina-ielchiieva/drill DRILL-5714

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/902.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #902


commit bdfb4ab29ae735b125f58a02c6302fbe43da31a7
Author: Arina Ielchiieva 
Date:   2017-08-10T11:34:13Z

DRILL-5714: Fix NPE when mapr-db plugin is used in table function




> Fix NPE when mapr-db plugin is used in table function
> -
>
> Key: DRILL-5714
> URL: https://issues.apache.org/jira/browse/DRILL-5714
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - MapRDB
>Affects Versions: 1.11.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: 1.12.0
>
>
> Query:
> {noformat}
> select * from table(mapr.`/tmp/test`(type=>'maprdb', allTextMode=>true))
> {noformat}
> Error:
> {noformat}
> 2017-08-03 13:13:56,527 [267cde6b-6327-11b4-a4f3-f59a8d82ae17:frag:0:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: NullPointerException
> Fragment 0:0
> [Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> NullPointerException
> Fragment 0:0
> [Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
>   at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
>  ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:295)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:264)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_95]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_95]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
> Caused by: org.apache.drill.common.exceptions.ExecutionSetupException: 
> java.lang.NullPointerException
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:52)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:36)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:156)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:136)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:109)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) 
> ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:207)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   ... 4 common frames omitted
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:46)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   ... 12 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6

[jira] [Commented] (DRILL-5691) multiple count distinct query planning error at physical phase

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121815#comment-16121815
 ] 

ASF GitHub Bot commented on DRILL-5691:
---

Github user weijietong commented on the issue:

https://github.com/apache/drill/pull/889
  
@arina-ielchiieva  have passed all the unit tests under java-exec bundle , 
got some errors which are sure not associated with this change, maybe my branch 
was old from the master.
>
  TestConstantFolding.testConstantFolding_allTypes:163 »  
org.apache.drill.commo...
  TestCustomUserAuthenticator.positiveUserAuth » UserRemote SYSTEM ERROR: 
URISyn...
  TestCustomUserAuthenticator.positiveUserAuthAfterNegativeUserAuth » 
UserRemote
  TestInfoSchema.selectFromAllTables » UserRemote SYSTEM ERROR: 
URISyntaxExcepti...
  
TestViewSupport.infoSchemaWithView:355->BaseTestQuery.testRunAndReturn:331 » Rpc
  TestInfoSchemaFilterPushDown.testFilterPushdown_NonEqual » UserRemote 
SYSTEM E...
  TestParquetScan.testSuccessFile:64->BaseTestQuery.testRunAndReturn:331 » 
Rpc o...
  TestTpchDistributedConcurrent.testConcurrentQueries:190 »  test timed out 
afte...

I also found the error to do `mvn test` ,got the same error as JIRA 
DRILL-4104 . through `mvn test -pl exec/java-exec` this method,I got the unit 
test result. I wonder how the devs do the test result .





> multiple count distinct query planning error at physical phase 
> ---
>
> Key: DRILL-5691
> URL: https://issues.apache.org/jira/browse/DRILL-5691
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.9.0, 1.10.0
>Reporter: weijie.tong
>
> I materialized the count distinct query result in a cache , added a plugin 
> rule to translate the (Aggregate、Aggregate、Project、Scan) or 
> (Aggregate、Aggregate、Scan) to (Project、Scan) at the PARTITION_PRUNING phase. 
> Then ,once user issue count distinct queries , it will be translated to query 
> the cache to get the result.
> eg1: " select count(*),sum(a) ,count(distinct b)  from t where dt=xx " 
> eg2:"select count(*),sum(a) ,count(distinct b) ,count(distinct c) from t 
> where dt=xxx "
> eg3:"select count(distinct b), count(distinct c) from t where dt=xxx"
> eg1 will be right and have a query result as I expected , but eg2 will be 
> wrong at the physical phase.The error info is here: 
> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269. 
> eg3 will also get the similar error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5691) multiple count distinct query planning error at physical phase

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121824#comment-16121824
 ] 

ASF GitHub Bot commented on DRILL-5691:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/889
  
On mine environment all tests pass. I used to have troubles but long time 
ago. I had to fix issues with timezones, memory usage `MAVEN_OPTS="-Xmx2048m 
-XX:MaxPermSize=1024m"`. Also I am pretty sure that some test have failed 
because the others did, usually the ones with Rpc exception. Maybe  `UserRemote 
SYSTEM ERROR: URISyntaxException` is connected with your env. You should check 
that. Also please rebase on the latest master (you can do `push -f` again), 
also if `TestTpchDistributedConcurrent.testConcurrentQueries` will be failing 
again, try to disable it but make sure it passes separately.


> multiple count distinct query planning error at physical phase 
> ---
>
> Key: DRILL-5691
> URL: https://issues.apache.org/jira/browse/DRILL-5691
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.9.0, 1.10.0
>Reporter: weijie.tong
>
> I materialized the count distinct query result in a cache , added a plugin 
> rule to translate the (Aggregate、Aggregate、Project、Scan) or 
> (Aggregate、Aggregate、Scan) to (Project、Scan) at the PARTITION_PRUNING phase. 
> Then ,once user issue count distinct queries , it will be translated to query 
> the cache to get the result.
> eg1: " select count(*),sum(a) ,count(distinct b)  from t where dt=xx " 
> eg2:"select count(*),sum(a) ,count(distinct b) ,count(distinct c) from t 
> where dt=xxx "
> eg3:"select count(distinct b), count(distinct c) from t where dt=xxx"
> eg1 will be right and have a query result as I expected , but eg2 will be 
> wrong at the physical phase.The error info is here: 
> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269. 
> eg3 will also get the similar error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5714) Fix NPE when mapr-db plugin is used in table function

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121903#comment-16121903
 ] 

ASF GitHub Bot commented on DRILL-5714:
---

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/902
  
@paul-rogers, thansk for the code review. 

Here is short description, hope it would be helpful:
The main change is in `MapRDBSubScan` class. The problem was that mapr-db 
plugin was trying to get format plugin from the default list of format plugins 
(which is created at Drill start up) but since table function uses custom 
format it was getting null value. For this case we need to get format plugin 
from `StoragePluginRegistry` which will create format plugin if it is absent in 
the list of default ones. For example, the same approach is used in 
`EasySubScan` class which is used for text plugin.



> Fix NPE when mapr-db plugin is used in table function
> -
>
> Key: DRILL-5714
> URL: https://issues.apache.org/jira/browse/DRILL-5714
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - MapRDB
>Affects Versions: 1.11.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: 1.12.0
>
>
> Query:
> {noformat}
> select * from table(mapr.`/tmp/test`(type=>'maprdb', allTextMode=>true))
> {noformat}
> Error:
> {noformat}
> 2017-08-03 13:13:56,527 [267cde6b-6327-11b4-a4f3-f59a8d82ae17:frag:0:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: NullPointerException
> Fragment 0:0
> [Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> NullPointerException
> Fragment 0:0
> [Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
>   at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
>  ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:295)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:264)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_95]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_95]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
> Caused by: org.apache.drill.common.exceptions.ExecutionSetupException: 
> java.lang.NullPointerException
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:52)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:36)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:156)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:136)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:109)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) 
> ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:207)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   ... 4 common frames omitted
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:46)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   ... 12 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DRILL-5714) Fix NPE when mapr-db plugin is used in table function

2017-08-10 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-5714:

Labels: ready-to-commit  (was: )

> Fix NPE when mapr-db plugin is used in table function
> -
>
> Key: DRILL-5714
> URL: https://issues.apache.org/jira/browse/DRILL-5714
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - MapRDB
>Affects Versions: 1.11.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>  Labels: ready-to-commit
> Fix For: 1.12.0
>
>
> Query:
> {noformat}
> select * from table(mapr.`/tmp/test`(type=>'maprdb', allTextMode=>true))
> {noformat}
> Error:
> {noformat}
> 2017-08-03 13:13:56,527 [267cde6b-6327-11b4-a4f3-f59a8d82ae17:frag:0:0] ERROR 
> o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: NullPointerException
> Fragment 0:0
> [Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: 
> NullPointerException
> Fragment 0:0
> [Error Id: 07937ad2-90ce-43e3-b9c1-2540e1df05fa on master:31010]
>   at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
>  ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:295)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:264)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>  [drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_95]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_95]
>   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
> Caused by: org.apache.drill.common.exceptions.ExecutionSetupException: 
> java.lang.NullPointerException
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:52)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:36)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:156)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:136)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:179)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:109)
>  ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) 
> ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:207)
>  [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   ... 4 common frames omitted
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.drill.exec.store.mapr.db.MapRDBScanBatchCreator.getBatch(MapRDBScanBatchCreator.java:46)
>  ~[drill-format-mapr-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>   ... 12 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5663) Drillbit fails to start when only keystore path is provided without keystore password.

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121989#comment-16121989
 ] 

ASF GitHub Bot commented on DRILL-5663:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/874
  
LGTM. 
Are you going to put the fix from DRILL_5712 into a separate PR?


> Drillbit fails to start when only keystore path is provided without keystore 
> password.
> --
>
> Key: DRILL-5663
> URL: https://issues.apache.org/jira/browse/DRILL-5663
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Sorabh Hamirwasia
>Assignee: Sindhuri Ramanarayan Rayavaram
>
> When we configure keystore path without keystore password inside 
> drill-override.conf for WebServer, then Drillbit fails to start. We should 
> explicitly check for either both being present or both being absent. If any 
> one of them is only present then throw startup exception for Drill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5712) Update the pom files with dependency exclusions for commons-codec

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122204#comment-16122204
 ] 

ASF GitHub Bot commented on DRILL-5712:
---

GitHub user sindhurirayavaram opened a pull request:

https://github.com/apache/drill/pull/903

DRILL-5712: Update the pom files with dependency exclusions for 
commons-codec

In java-exec, we are adding a dependency for commons-codec of version 1.10. 
Other dependencies like hadoop-common, parquet-column etc are trying to 
download different versions for common codec. Exclusions should be added for 
common-codec in these dependencies.
This change has added exclusions in all the pom.xml files.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sindhurirayavaram/drill DRILL-5712

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/903.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #903


commit 88389f83e7c786c47c907b73a086d677a224778a
Author: Sindhuri Rayavaram 
Date:   2017-08-10T19:50:58Z

DRILL-5712: Update the pom files with dependency exclusions for 
commons-codec




> Update the pom files with dependency exclusions for commons-codec
> -
>
> Key: DRILL-5712
> URL: https://issues.apache.org/jira/browse/DRILL-5712
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Sindhuri Ramanarayan Rayavaram
>Assignee: Sindhuri Ramanarayan Rayavaram
>
> In java-exec, we are adding a dependency for commons-codec of version 1.10. 
> Other dependencies like hadoop-common, parquet-column etc are trying to 
> download different versions for common codec. Exclusions should be added for 
> common-codec in these dependencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5663) Drillbit fails to start when only keystore path is provided without keystore password.

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122208#comment-16122208
 ] 

ASF GitHub Bot commented on DRILL-5663:
---

Github user sindhurirayavaram commented on the issue:

https://github.com/apache/drill/pull/874
  
Created a separate pull request for DRILL-5712.


> Drillbit fails to start when only keystore path is provided without keystore 
> password.
> --
>
> Key: DRILL-5663
> URL: https://issues.apache.org/jira/browse/DRILL-5663
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Sorabh Hamirwasia
>Assignee: Sindhuri Ramanarayan Rayavaram
>
> When we configure keystore path without keystore password inside 
> drill-override.conf for WebServer, then Drillbit fails to start. We should 
> explicitly check for either both being present or both being absent. If any 
> one of them is only present then throw startup exception for Drill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-5715) Performance of refactored HashAgg operator regressed

2017-08-10 Thread Kunal Khatua (JIRA)
Kunal Khatua created DRILL-5715:
---

 Summary: Performance of refactored HashAgg operator regressed
 Key: DRILL-5715
 URL: https://issues.apache.org/jira/browse/DRILL-5715
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Codegen
Affects Versions: 1.11.0
 Environment: 10-node RHEL 6.4 (32 Core, 256GB RAM)
Reporter: Kunal Khatua
Assignee: Boaz Ben-Zvi
 Fix For: 1.12.0


When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill 
Drill 1.11.0 : 2675de42-3789-47b8-29e8-c5077af136db.sys.drill 

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
Drill 1.10.0 : 
 Profile: 26736242-d084-6604-aac9-927e729da755.sys.drill
 HotSpot: drill-1.10.0_hotspot.jpg 
Drill 1.11.0 : 
 Profile: 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill
 HotSpot: drill-1.11.0_hotspot.jpg 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DRILL-5715) Performance of refactored HashAgg operator regressed

2017-08-10 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-5715:

Attachment: 26736242-d084-6604-aac9-927e729da755.sys.drill
26736615-9e86-dac9-ad77-b022fd791f67.sys.drill
2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill
2675de42-3789-47b8-29e8-c5077af136db.sys.drill
drill-1.11.0_callTree.png
drill-1.11.0_hotspot.png
drill-1.10.0_callTree.png
drill-1.10.0_hotspot.png

> Performance of refactored HashAgg operator regressed
> 
>
> Key: DRILL-5715
> URL: https://issues.apache.org/jira/browse/DRILL-5715
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.11.0
> Environment: 10-node RHEL 6.4 (32 Core, 256GB RAM)
>Reporter: Kunal Khatua
>Assignee: Boaz Ben-Zvi
>  Labels: performance, regression
> Fix For: 1.12.0
>
> Attachments: 26736242-d084-6604-aac9-927e729da755.sys.drill, 
> 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill, 
> 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill, 
> 2675de42-3789-47b8-29e8-c5077af136db.sys.drill, drill-1.10.0_callTree.png, 
> drill-1.10.0_hotspot.png, drill-1.11.0_callTree.png, drill-1.11.0_hotspot.png
>
>
> When running the following simple HashAgg-based query on a TPCH-table - 
> Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
> disable any possible spilling to disk)
> {code:sql}
> select count(*) 
> from (
>   select l_quantity
> , count(l_orderkey) 
>   from lineitem 
>   group by l_quantity 
> )  {code}
> the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
> JDBC client].
> To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
> modified to 
> {code}drill.exec.hashagg.num_partitions : 1{code}
> Attached are two profiles
> Drill 1.10.0 : 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill 
> Drill 1.11.0 : 2675de42-3789-47b8-29e8-c5077af136db.sys.drill 
> A separate run was done for both scenarios with the 
> {{planner.width.max_per_node=10}} and profiled with YourKit.
> Image snippets are attached, indicating the hotspots in both builds:
> Drill 1.10.0 : 
>  Profile: 26736242-d084-6604-aac9-927e729da755.sys.drill
>  HotSpot: drill-1.10.0_hotspot.jpg 
> Drill 1.11.0 : 
>  Profile: 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill
>  HotSpot: drill-1.11.0_hotspot.jpg 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DRILL-5715) Performance of refactored HashAgg operator regressed

2017-08-10 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-5715:

Description: 
When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
Drill 1.10.0 : 
 Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
 HotSpot: drill-1.10.0_hotspot.jpg
Drill 1.11.0 : 
 Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
 HotSpot: [^drill-1.11.0_hotspot.jpg] 


  was:
When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill 
Drill 1.11.0 : 2675de42-3789-47b8-29e8-c5077af136db.sys.drill 

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
Drill 1.10.0 : 
 Profile: 26736242-d084-6604-aac9-927e729da755.sys.drill
 HotSpot: drill-1.10.0_hotspot.jpg 
Drill 1.11.0 : 
 Profile: 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill
 HotSpot: drill-1.11.0_hotspot.jpg 



> Performance of refactored HashAgg operator regressed
> 
>
> Key: DRILL-5715
> URL: https://issues.apache.org/jira/browse/DRILL-5715
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.11.0
> Environment: 10-node RHEL 6.4 (32 Core, 256GB RAM)
>Reporter: Kunal Khatua
>Assignee: Boaz Ben-Zvi
>  Labels: performance, regression
> Fix For: 1.12.0
>
> Attachments: 26736242-d084-6604-aac9-927e729da755.sys.drill, 
> 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill, 
> 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill, 
> 2675de42-3789-47b8-29e8-c5077af136db.sys.drill, drill-1.10.0_callTree.png, 
> drill-1.10.0_hotspot.png, drill-1.11.0_callTree.png, drill-1.11.0_hotspot.png
>
>
> When running the following simple HashAgg-based query on a TPCH-table - 
> Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
> disable any possible spilling to disk)
> {code:sql}
> select count(*) 
> from (
>   select l_quantity
> , count(l_orderkey) 
>   from lineitem 
>   group by l_quantity 
> )  {code}
> the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
> JDBC client].
> To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
> modified to 
> {code}drill.exec.hashagg.num_partitions : 1{code}
> Attached are two profiles
> Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
> Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]
> A separate run was done for both scenarios with the 
> {{planner.width.max_per_node=10}} and profiled with YourKit.
> Image snippets are attached, indicating the hotspots in both builds:
> Drill 1.10.0 : 
>  Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
>  HotSpot: drill-1.10.0_hotspot.jpg
> Drill 1.11.0 : 
>  Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
>  HotSpot: [^drill-1.11.0_hotspot.jpg] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DRILL-5715) Performance of refactored HashAgg operator regressed

2017-08-10 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-5715:

Description: 
When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
Drill 1.10.0 : 
 Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
 CallTree: [^drill-1.10.0_callTree.png]
 HotSpot: [^drill-1.10.0_hotspot.png]
Drill 1.11.0 : 
 Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
 CallTree: [^drill-1.11.0_callTree.png]
 HotSpot: [^drill-1.11.0_hotspot.png] 


  was:
When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
Drill 1.10.0 : 
 Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
 HotSpot: drill-1.10.0_hotspot.jpg
Drill 1.11.0 : 
 Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
 HotSpot: [^drill-1.11.0_hotspot.jpg] 



> Performance of refactored HashAgg operator regressed
> 
>
> Key: DRILL-5715
> URL: https://issues.apache.org/jira/browse/DRILL-5715
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.11.0
> Environment: 10-node RHEL 6.4 (32 Core, 256GB RAM)
>Reporter: Kunal Khatua
>Assignee: Boaz Ben-Zvi
>  Labels: performance, regression
> Fix For: 1.12.0
>
> Attachments: 26736242-d084-6604-aac9-927e729da755.sys.drill, 
> 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill, 
> 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill, 
> 2675de42-3789-47b8-29e8-c5077af136db.sys.drill, drill-1.10.0_callTree.png, 
> drill-1.10.0_hotspot.png, drill-1.11.0_callTree.png, drill-1.11.0_hotspot.png
>
>
> When running the following simple HashAgg-based query on a TPCH-table - 
> Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
> disable any possible spilling to disk)
> {code:sql}
> select count(*) 
> from (
>   select l_quantity
> , count(l_orderkey) 
>   from lineitem 
>   group by l_quantity 
> )  {code}
> the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
> JDBC client].
> To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
> modified to 
> {code}drill.exec.hashagg.num_partitions : 1{code}
> Attached are two profiles
> Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
> Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]
> A separate run was done for both scenarios with the 
> {{planner.width.max_per_node=10}} and profiled with YourKit.
> Image snippets are attached, indicating the hotspots in both builds:
> Drill 1.10.0 : 
>  Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
>  CallTree: [^drill-1.10.0_callTree.png]
>  HotSpot: [^drill-1.10.0_hotspot.png]
> Drill 1.11.0 : 
>  Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
>  CallTree: [^drill-1.11.0_callTree.png]
>  HotSpot: [^drill-1.11.0_hotspot.png] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DRILL-5715) Performance of refactored HashAgg operator regressed

2017-08-10 Thread Kunal Khatua (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-5715:

Description: 
When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
*Drill 1.10.0* : 
 Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
 CallTree: [^drill-1.10.0_callTree.png]
 HotSpot: [^drill-1.10.0_hotspot.png]
!drill-1.10.0_hotspot.png|drill-1.10.0_hotspot!

*Drill 1.11.0* : 
 Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
 CallTree: [^drill-1.11.0_callTree.png]
 HotSpot: [^drill-1.11.0_hotspot.png] 
!drill-1.11.0_hotspot.png|drill-1.11.0_hotspot!


  was:
When running the following simple HashAgg-based query on a TPCH-table - 
Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
disable any possible spilling to disk)

{code:sql}
select count(*) 
from (
  select l_quantity
, count(l_orderkey) 
  from lineitem 
  group by l_quantity 
)  {code}

the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
JDBC client].

To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
modified to 
{code}drill.exec.hashagg.num_partitions : 1{code}

Attached are two profiles
Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]

A separate run was done for both scenarios with the 
{{planner.width.max_per_node=10}} and profiled with YourKit.

Image snippets are attached, indicating the hotspots in both builds:
Drill 1.10.0 : 
 Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
 CallTree: [^drill-1.10.0_callTree.png]
 HotSpot: [^drill-1.10.0_hotspot.png]
Drill 1.11.0 : 
 Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
 CallTree: [^drill-1.11.0_callTree.png]
 HotSpot: [^drill-1.11.0_hotspot.png] 



> Performance of refactored HashAgg operator regressed
> 
>
> Key: DRILL-5715
> URL: https://issues.apache.org/jira/browse/DRILL-5715
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Affects Versions: 1.11.0
> Environment: 10-node RHEL 6.4 (32 Core, 256GB RAM)
>Reporter: Kunal Khatua
>Assignee: Boaz Ben-Zvi
>  Labels: performance, regression
> Fix For: 1.12.0
>
> Attachments: 26736242-d084-6604-aac9-927e729da755.sys.drill, 
> 26736615-9e86-dac9-ad77-b022fd791f67.sys.drill, 
> 2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill, 
> 2675de42-3789-47b8-29e8-c5077af136db.sys.drill, drill-1.10.0_callTree.png, 
> drill-1.10.0_hotspot.png, drill-1.11.0_callTree.png, drill-1.11.0_hotspot.png
>
>
> When running the following simple HashAgg-based query on a TPCH-table - 
> Lineitem with 6Billion rows on a 10 node setup (with a single partition to 
> disable any possible spilling to disk)
> {code:sql}
> select count(*) 
> from (
>   select l_quantity
> , count(l_orderkey) 
>   from lineitem 
>   group by l_quantity 
> )  {code}
> the runtime increased from {{7.378 sec}} to {{11.323 sec}} [reported by the 
> JDBC client].
> To disable spill-to-disk in Drill-1.11.0, the {{drill-override.conf}} was 
> modified to 
> {code}drill.exec.hashagg.num_partitions : 1{code}
> Attached are two profiles
> Drill 1.10.0 : [^2675cc73-9481-16e0-7d21-5f1338611e5f.sys.drill] 
> Drill 1.11.0 : [^2675de42-3789-47b8-29e8-c5077af136db.sys.drill]
> A separate run was done for both scenarios with the 
> {{planner.width.max_per_node=10}} and profiled with YourKit.
> Image snippets are attached, indicating the hotspots in both builds:
> *Drill 1.10.0* : 
>  Profile: [^26736242-d084-6604-aac9-927e729da755.sys.drill]
>  CallTree: [^drill-1.10.0_callTree.png]
>  HotSpot: [^drill-1.10.0_hotspot.png]
> !drill-1.10.0_hotspot.png|drill-1.10.0_hotspot!
> *Drill 1.11.0* : 
>  Profile: [^26736615-9e86-dac9-ad77-b022fd791f67.sys.drill]
>  CallTree: [^drill-1.11.0_callTree.png]
>  HotSpot: [^drill-1.11.0_hotspot.png] 
> !drill-1.11.0_hotspot.png|drill-1.11.0_hotspot!



--
Th

[jira] [Commented] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy commented on DRILL-5697:
-

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)

Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)

I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.

Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)













> Improve performance of filter operator for pattern matching
> ---
>
> Key: DRILL-5697
> URL: https://issues.apache.org/jira/browse/DRILL-5697
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Execution - Flow
>Affects Versions: 1.11.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
>
> Queries using filter with sql like operator use Java regex library for 
> pattern matching. However, for cases like %abc (ends with abc), abc% (starts 
> with abc), %abc% (contains abc), it is observed that implementing these cases 
> with simple code instead of using regex library provides good performance 
> boost (4-6x). Idea is to use special case code for simple, common cases and 
> fall back to Java regex library for complicated ones. That will provide good 
> performance benefit for most common cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 9:52 PM:
--

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)

Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)

I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.

Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) from `/Users/ppenumarthy/MA

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:01 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)

Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(*) fro

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:02 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec




For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select cou

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:02 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec




For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
select count

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:03 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


*For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

Baseline:
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


Use find instead of matcher.matches(). The numbers are better, but not by much.
sel

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:05 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


**Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.*
Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:06 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


**Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.*
Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTEC

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:07 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

* Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

* Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTE

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:06 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*


* Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTE

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:07 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

* Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*


* Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRT

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:08 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

*Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

* Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTEC

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:05 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.


Finally, I implemented simple character by character comparison functions for 
each of the special cases 
and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec


*For all ASCII, since getByte is doing a bounds check every time we call it, I 
want to see if getting the bytes  in one shot is better. That did not help much 
with performance. In fact, it made it worse for 'a%' type of  match.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s (vs 1m 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s (vs 9.7 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec (vs 1min 6 sec baseline)


*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains.
Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/t

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:18 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

* To avoid the bounds check every time we try to get a byte , skip trying to 
figure out if it is all ASCII, just convert to charBuffer always.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

*Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

* To avoid the bounds check every time we try to get a byte , skip trying to 
figure out if it is all ASCII, just convert to charBuffer always.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testd

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:18 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

* To avoid the bounds check every time we try to get a byte , skip trying to 
figure out if it is all ASCII, just convert to charBuffer always.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

*Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec



*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:19 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

*Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

* To avoid the bounds check every time we try to get a byte , skip trying to 
figure out if it is all ASCII, just convert to charBuffer always.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
1m 2s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
16.688s

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
55 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘

[jira] [Comment Edited] (DRILL-5697) Improve performance of filter operator for pattern matching

2017-08-10 Thread Padma Penumarthy (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122402#comment-16122402
 ] 

Padma Penumarthy edited comment on DRILL-5697 at 8/10/17 10:20 PM:
---

I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Build charBuffer always (even if it is all ASCII) and use String functions 
startsWith, endsWith and contains. Numbers are better. But, not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

*Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
6.576 sec (vs. 1m 10s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
6.190s (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
11.34s (vs. 1m 6s baseline)














was (Author: ppenumarthy):
I did bunch of experiments to figure out what should be the best approach.

Basically, here is what we do for "like" operation :
1. Build a charSequence wrapper for varChar UTF8 input.  If input is all ASCII, 
we directly read the byte as character from PlatformDependent. Else, we decode 
UTF-8 bytes, copy them to charBuffer and read characters from that. 
2. regex matching is done on this charSequenceWrapper, which provides charAt 
functionality as explained above.

All the numbers below are processing time of filter operation.

*Baseline:*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a' 
1m 10 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%'
9.7 sec

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a%'
1m 6 sec

*Use find instead of matcher.matches(). The numbers are better, but not by 
much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a';
30 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like 'a%';
14 sec (vs 9.794s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’;
32 sec (vs 1min 6s baseline)


*Next, I tried building charBuffer always (even if it is all ASCII) and use 
String functions startsWith, endsWith and contains. Numbers are better. But, 
not by much.*
select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like '%a'
45 sec (vs 1min 10 sec baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘%a%’
34 sec (vs 1min 6s baseline)

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where 
l_comment like ‘a%’
46  (vs 9.794s baseline)


*I tried Google RE2 library. Got much worse numbers than what we are getting 
with Java Regex Library.*

*Finally, I implemented simple character by character comparison functions for 
each of the special cases and got pretty good numbers.*

select count(\*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where

[jira] [Commented] (DRILL-5431) Support SSL

2017-08-10 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122483#comment-16122483
 ] 

Parth Chandra commented on DRILL-5431:
--

Design doc: 
https://docs.google.com/document/d/1ghUnblzttRj94mdAaq7HzlIe0EdQc59qSeCoAr-vRZw/edit#
[~sudheeshkatkam], [~laurentgo], [~shamirwasia], [~robertw]   please review.


> Support SSL
> ---
>
> Key: DRILL-5431
> URL: https://issues.apache.org/jira/browse/DRILL-5431
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Client - Java, Client - ODBC
>Reporter: Sudheesh Katkam
>Assignee: Sudheesh Katkam
>
> Support SSL between Drillbit and JDBC/ODBC drivers. Drill already supports 
> HTTPS for web traffic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-5713) Doing joins on tables that share column names in a JDBC store returns incorrect results

2017-08-10 Thread Timothy Farkas (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-5713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122513#comment-16122513
 ] 

Timothy Farkas commented on DRILL-5713:
---

Discussed with Paul Rogers and Juinfeng. This will actually be difficult to do 
since we would have to change drill's column name references to hold the table 
and schema information propagate that information to the operators. We should 
probably fix this correctly at some point, but it is not urgent for now so I 
will put this ticket on the back burner.

> Doing joins on tables that share column names in a JDBC store returns 
> incorrect results
> ---
>
> Key: DRILL-5713
> URL: https://issues.apache.org/jira/browse/DRILL-5713
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.11.0
> Environment: My mac running the latest drill in embedded mode.
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>  Labels: newbie
>
> If there are two tables in Postgres that share column names, incorrect 
> results are returned when a join is done between the two tables.
> For example if we have two tables: categories and categories2 with the 
> following contents:

+---+-+---+
> | categoryguid  | categoryparentguid  | categoryname  |
> +---+-+---+
> | id1   | null| restaurants   |
> | null  | id1 | food&Dining   |
> | id2   | null| Coffee Shops  |
> | null  | id2 | food&Dining   |
> +---+-+---+
> Then the following join query returns incorrectly names columns and incorrect 
> null values:

select cat.categoryname, cat2.categoryname from 
> postgres.public.categories cat join postgres.public.categories2 cat2 on 
> (cat.categoryguid = cat2.categoryguid) where cat.categoryguid IS NOT NULL;
> +---++
> | categoryname  | categoryname0  |
> +---++
> | restaurants   | null   |
> | Coffee Shops  | null   |
> +---++



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DRILL-4211) Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased columns to be returned.

2017-08-10 Thread Timothy Farkas (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122525#comment-16122525
 ] 

Timothy Farkas commented on DRILL-4211:
---

After discussing with Paul and Juinfeng, this issue can be fixed by pushing the 
whole select portion of the query down into the JDBC operator. And required the 
select portion of the query to alias potentially conflicting columns.

> Column aliases not pushed down to JDBC stores in some cases when Drill 
> expects aliased columns to be returned.
> --
>
> Key: DRILL-4211
> URL: https://issues.apache.org/jira/browse/DRILL-4211
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.3.0, 1.11.0
> Environment: Postgres db stroage
>Reporter: Robert Hamilton-Smith
>Assignee: Timothy Farkas
>  Labels: newbie
>
> When making an sql statement that incorporates a join to a table and then a 
> self join to that table to get a parent value , Drill brings back 
> inconsistent results. 
> Here is the sql in postgres with correct output:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from transactions trx
> join categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join categories w1 on (cat.categoryparentguid = w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL;
> {code}
> Output:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|food&Dining|
> |id1|restaurants|food&Dining|
> |id2|Coffee Shops|food&Dining|
> |id2|Coffee Shops|food&Dining|
> When run in Drill with correct storage prefix:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from db.schema.transactions trx
> join db.schema.categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join db.schema.wpfm_categories w1 on (cat.categoryparentguid = 
> w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL
> {code}
> Results are:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|null|
> |id1|restaurants|null|
> |id2|Coffee Shops|null|
> |id2|Coffee Shops|null|
> Physical plan is:
> {code:sql}
> 00-00Screen : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) 
> categoryname, VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = 
> {110.0 rows, 110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64293
> 00-01  Project(categoryguid=[$0], categoryname=[$1], parentcat=[$2]) : 
> rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64292
> 00-02Project(categoryguid=[$9], categoryname=[$41], parentcat=[$47]) 
> : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64291
> 00-03  Jdbc(sql=[SELECT *
> FROM "public"."transactions"
> INNER JOIN (SELECT *
> FROM "public"."categories"
> WHERE "categoryparentguid" IS NOT NULL) AS "t" ON 
> "transactions"."categoryguid" = "t"."categoryguid"
> INNER JOIN "public"."categories" AS "categories0" ON "t"."categoryparentguid" 
> = "categories0"."categoryguid"]) : rowType = RecordType(VARCHAR(255) 
> transactionguid, VARCHAR(255) relatedtransactionguid, VARCHAR(255) 
> transactioncode, DECIMAL(1, 0) transactionpending, VARCHAR(50) 
> transactionrefobjecttype, VARCHAR(255) transactionrefobjectguid, 
> VARCHAR(1024) transactionrefobjectvalue, TIMESTAMP(6) transactiondate, 
> VARCHAR(256) transactiondescription, VARCHAR(50) categoryguid, VARCHAR(3) 
> transactioncurrency, DECIMAL(15, 3) transactionoldbalance, DECIMAL(13, 3) 
> transactionamount, DECIMAL(15, 3) transactionnewbalance, VARCHAR(512) 
> transactionnotes, DECIMAL(2, 0) transactioninstrumenttype, VARCHAR(20) 
> transactioninstrumentsubtype, VARCHAR(20) transactioninstrumentcode, 
> VARCHAR(50) transactionorigpartyguid, VARCHAR(255) 
> transactionorigaccountguid, VARCHAR(50) transactionrecpartyguid, VARCHAR(255) 
> transactionrecaccountguid, VARCHAR(256) transactionstatementdesc, DECIMAL(1, 
> 0) transactionsplit, DECIMAL(1, 0) transactionduplicated, DECIMAL(1, 0) 
> transactionrecategorized, TIMESTAMP(6) transactioncreatedat, TIMESTAMP(6) 
> transactionupdatedat, VARCHAR(50) transactionmatrulerefobjtype, VARCHAR(50) 
> transactionmatrulerefobjguid, VARCHAR(50) transactionmatrulerefobjvalue, 
> VARCHAR(50) transactionuserruleguid, DECIMAL(2, 0) transactionsplitorder, 
> TIMESTAMP(6) transactionprocessedat, TIMESTAMP(6) 
> transactioncategoryassignat, VARCHAR(50) transactionsystemcategoryguid, 
> VARCHAR

[jira] [Created] (DRILL-5716) Queue-based memory assignment for buffering operators

2017-08-10 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5716:
--

 Summary: Queue-based memory assignment for buffering operators
 Key: DRILL-5716
 URL: https://issues.apache.org/jira/browse/DRILL-5716
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.12.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Apache Drill already has a queueing feature based on ZK semaphores. We did a 
bit of testing to  show that the feature does, in fact work. We propose to 
enhance the feature with some light revisions to make work with the "managed" 
external sort and the newly-added spilling feature for the hash agg operator. 
The key requirement is to build on what we have for now; we may want to tackle 
a larger project to create a more complete solution later.

Existing functionality:

* Two ZK-based queues called the “small” and “large” query queues.
* A threshold, call it T, given as a query cost, to determine the queue into 
which a query will go.
* Admit levels for the two queues: call them Qs and Ql.

Basically, when a query comes in:

* Plan the query as usual.
* Obtain the final query cost from the planner, call this C.
* If C

[jira] [Updated] (DRILL-4211) Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased columns to be returned.

2017-08-10 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-4211:
---
Environment: Postgres db storage  (was: Postgres db stroage)

> Column aliases not pushed down to JDBC stores in some cases when Drill 
> expects aliased columns to be returned.
> --
>
> Key: DRILL-4211
> URL: https://issues.apache.org/jira/browse/DRILL-4211
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.3.0, 1.11.0
> Environment: Postgres db storage
>Reporter: Robert Hamilton-Smith
>Assignee: Timothy Farkas
>  Labels: newbie
>
> When making an sql statement that incorporates a join to a table and then a 
> self join to that table to get a parent value , Drill brings back 
> inconsistent results. 
> Here is the sql in postgres with correct output:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from transactions trx
> join categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join categories w1 on (cat.categoryparentguid = w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL;
> {code}
> Output:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|food&Dining|
> |id1|restaurants|food&Dining|
> |id2|Coffee Shops|food&Dining|
> |id2|Coffee Shops|food&Dining|
> When run in Drill with correct storage prefix:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from db.schema.transactions trx
> join db.schema.categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join db.schema.wpfm_categories w1 on (cat.categoryparentguid = 
> w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL
> {code}
> Results are:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|null|
> |id1|restaurants|null|
> |id2|Coffee Shops|null|
> |id2|Coffee Shops|null|
> Physical plan is:
> {code:sql}
> 00-00Screen : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) 
> categoryname, VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = 
> {110.0 rows, 110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64293
> 00-01  Project(categoryguid=[$0], categoryname=[$1], parentcat=[$2]) : 
> rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64292
> 00-02Project(categoryguid=[$9], categoryname=[$41], parentcat=[$47]) 
> : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64291
> 00-03  Jdbc(sql=[SELECT *
> FROM "public"."transactions"
> INNER JOIN (SELECT *
> FROM "public"."categories"
> WHERE "categoryparentguid" IS NOT NULL) AS "t" ON 
> "transactions"."categoryguid" = "t"."categoryguid"
> INNER JOIN "public"."categories" AS "categories0" ON "t"."categoryparentguid" 
> = "categories0"."categoryguid"]) : rowType = RecordType(VARCHAR(255) 
> transactionguid, VARCHAR(255) relatedtransactionguid, VARCHAR(255) 
> transactioncode, DECIMAL(1, 0) transactionpending, VARCHAR(50) 
> transactionrefobjecttype, VARCHAR(255) transactionrefobjectguid, 
> VARCHAR(1024) transactionrefobjectvalue, TIMESTAMP(6) transactiondate, 
> VARCHAR(256) transactiondescription, VARCHAR(50) categoryguid, VARCHAR(3) 
> transactioncurrency, DECIMAL(15, 3) transactionoldbalance, DECIMAL(13, 3) 
> transactionamount, DECIMAL(15, 3) transactionnewbalance, VARCHAR(512) 
> transactionnotes, DECIMAL(2, 0) transactioninstrumenttype, VARCHAR(20) 
> transactioninstrumentsubtype, VARCHAR(20) transactioninstrumentcode, 
> VARCHAR(50) transactionorigpartyguid, VARCHAR(255) 
> transactionorigaccountguid, VARCHAR(50) transactionrecpartyguid, VARCHAR(255) 
> transactionrecaccountguid, VARCHAR(256) transactionstatementdesc, DECIMAL(1, 
> 0) transactionsplit, DECIMAL(1, 0) transactionduplicated, DECIMAL(1, 0) 
> transactionrecategorized, TIMESTAMP(6) transactioncreatedat, TIMESTAMP(6) 
> transactionupdatedat, VARCHAR(50) transactionmatrulerefobjtype, VARCHAR(50) 
> transactionmatrulerefobjguid, VARCHAR(50) transactionmatrulerefobjvalue, 
> VARCHAR(50) transactionuserruleguid, DECIMAL(2, 0) transactionsplitorder, 
> TIMESTAMP(6) transactionprocessedat, TIMESTAMP(6) 
> transactioncategoryassignat, VARCHAR(50) transactionsystemcategoryguid, 
> VARCHAR(50) transactionorigmandateid, VARCHAR(100) fingerprint, VARCHAR(50) 
> categoryguid0, VARCHAR(50) categoryparentguid, DECIMAL(3, 0) categorytype, 
> VARCHAR(50) categoryname, VARCHAR(50) categorydescription, VARCHAR(50) 
> partygu

[jira] [Commented] (DRILL-4211) Column aliases not pushed down to JDBC stores in some cases when Drill expects aliased columns to be returned.

2017-08-10 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122661#comment-16122661
 ] 

Paul Rogers commented on DRILL-4211:


As we also discussed, Drill has a gap when handling ambiguous columns. At run 
time, Drill retains only the column name with none of the additional 
information used at planning time. That information includes:

_storage_plugin_ . _schema_ . _table_ . _column_

Consider the following queries:

{code}
SELECT `dfs.myDir`.`customers.csv`.col, `dfs.yourDir`.`customers.csv`.col
FROM `dfs.myDir`.`customers.csv`, `dfs.yourDir`.`customers.csv`
WHERE `dfs.myDir`.`customers.csv`.col = `dfs.yourDir`.`customers.csv`.col

SELECT `customers1.csv`.col, `customers2.csv`.col
FROM `customers1.csv`, `customers2.csv`
WHERE `customers1.csv`.col = `customers2.csv`.col

SELECT `dfs.myDir`.`customers.csv`.col, *
WHERE col = 10
{code}

(This assumes that Drill allows fully-qualfied names in the WHERE clause...)

In each case, the columns are unambiguous to a human. In the first, the fully 
qualified path is unique. In the second, the works space is implied, but the 
table.column name is unique. In the third, "col" will appear twice, but it is 
clear that the "col" in the WHERE clause refers to the explicit "col", even if 
the wildcard expands to "col, foo, bar".

At run time, however, Drill uses only the tail column name. In fact, another 
JIRA asks that dots in column names be treated as part of the name. (This 
allows a JSON field name of the form "customers.col"...)

The most general solution would be to modify Drill's runtime naming system to 
provide fully qualified names:

* The full plugin, workspace, table, column name, represented as a 
{{SchemaPath}}.
* A "hint" in a name that says that it came from a wildcard expansion rather 
than explicit.
* When names are ambiguous (and so we want to add a "0" to the name, prefer to 
add the suffix to names that come from wildcard expansion.

There are probably other rules needed to ensure that all semantically valid 
references in a query are translated into unique references at run time.

Note that, because Drill is schema-on-read, some of the name resolution 
normally done by the planner (which considers fully qualified names) must be 
pushed into the execution framework in Drill.

> Column aliases not pushed down to JDBC stores in some cases when Drill 
> expects aliased columns to be returned.
> --
>
> Key: DRILL-4211
> URL: https://issues.apache.org/jira/browse/DRILL-4211
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Relational Operators
>Affects Versions: 1.3.0, 1.11.0
> Environment: Postgres db storage
>Reporter: Robert Hamilton-Smith
>Assignee: Timothy Farkas
>  Labels: newbie
>
> When making an sql statement that incorporates a join to a table and then a 
> self join to that table to get a parent value , Drill brings back 
> inconsistent results. 
> Here is the sql in postgres with correct output:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from transactions trx
> join categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join categories w1 on (cat.categoryparentguid = w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL;
> {code}
> Output:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|food&Dining|
> |id1|restaurants|food&Dining|
> |id2|Coffee Shops|food&Dining|
> |id2|Coffee Shops|food&Dining|
> When run in Drill with correct storage prefix:
> {code:sql}
> select trx.categoryguid,
> cat.categoryname, w1.categoryname as parentcat
> from db.schema.transactions trx
> join db.schema.categories cat on (cat.CATEGORYGUID = trx.CATEGORYGUID)
> join db.schema.wpfm_categories w1 on (cat.categoryparentguid = 
> w1.categoryguid)
> where cat.categoryparentguid IS NOT NULL
> {code}
> Results are:
> ||categoryid||categoryname||parentcategory||
> |id1|restaurants|null|
> |id1|restaurants|null|
> |id2|Coffee Shops|null|
> |id2|Coffee Shops|null|
> Physical plan is:
> {code:sql}
> 00-00Screen : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) 
> categoryname, VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = 
> {110.0 rows, 110.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64293
> 00-01  Project(categoryguid=[$0], categoryname=[$1], parentcat=[$2]) : 
> rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) categoryname, 
> VARCHAR(50) parentcat): rowcount = 100.0, cumulative cost = {100.0 rows, 
> 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 64292
> 00-02Project(categoryguid=[$9], categoryname=[$41], parentcat=[$47]) 
> : rowType = RecordType(VARCHAR(50) categoryguid, VARCHAR(50) category