Re: [VOTE] Release Apache Drill 1.5.0 RC3

2016-02-09 Thread Jason Altekruse
For anyone who jumped on testing out the release candidate early I'm going
to have to ask you to re-download the artifacts you verified. I had
prepared an earlier version of this candidate (but didn't get a chance to
start the vote) before another regression was identified and fixed today. I
had forgotten to update the source and binary artifacts on my apache web
space with the new ones.

I just uploaded the corrected versions of the artifacts after verifying
their git.properties files to ensure they were the correct versions.

The last modified dates on the correct versions are between: 10-Feb-2016
03:30 and 10-Feb-2016 03:34

(the copy took a few minutes as my home upload speed isn't great :P)

Sorry about the mistake, everything should be good to go now.

Thanks,
Jason



On Tue, Feb 9, 2016 at 7:08 PM, Jason Altekruse 
wrote:

> Hello all,
>
> I'd like to propose the forth release candidate (rc3) of Apache Drill,
> version
> 1.5.0. It covers a total of 60 resolved JIRAs [1]. Thanks to everyone who
> contributed to this release. This release candidate includes fixes for
> DRILL-4235 and DRILL-4380, both regressions found sine the last release
> candidate.
>
> I also pulled in two bug fixes (4230, 4349) that had been merged into
> master since making the release branch, they looked useful to include and
> were both had little risk of introducing regressions.
>
> The tarball artifacts are hosted at [2] and the maven artifacts are hosted
> at
> [3]. This release candidate is based on commit
> 3f228d34782741457a14e28b0d1fdbc35a4fd958 located at [4].
>
> The vote will be open for the next 72 hours ending at 7 PM Pacific,
> February 12th, 2016.
>
> [ ] +1
> [ ] +0
> [ ] -1
>
> Here's my vote: +1
>
> Thanks,
> Jason
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820&version=12332948
>
> [2] http://people.apache.org/~json/apache-drill-1.5.0.rc3/
> [3] https://repository.apache.org/content/repositories/orgapachedrill-1028
> [4] https://github.com/jaltekruse/incubator-drill/tree/drill-1.5.0-rc3
>


[VOTE] Release Apache Drill 1.5.0 RC3

2016-02-09 Thread Jason Altekruse
Hello all,

I'd like to propose the forth release candidate (rc3) of Apache Drill,
version
1.5.0. It covers a total of 60 resolved JIRAs [1]. Thanks to everyone who
contributed to this release. This release candidate includes fixes for
DRILL-4235 and DRILL-4380, both regressions found sine the last release
candidate.

I also pulled in two bug fixes (4230, 4349) that had been merged into
master since making the release branch, they looked useful to include and
were both had little risk of introducing regressions.

The tarball artifacts are hosted at [2] and the maven artifacts are hosted
at
[3]. This release candidate is based on commit
3f228d34782741457a14e28b0d1fdbc35a4fd958 located at [4].

The vote will be open for the next 72 hours ending at 7 PM Pacific,
February 12th, 2016.

[ ] +1
[ ] +0
[ ] -1

Here's my vote: +1

Thanks,
Jason

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12313820&version=12332948

[2] http://people.apache.org/~json/apache-drill-1.5.0.rc3/
[3] https://repository.apache.org/content/repositories/orgapachedrill-1028
[4] https://github.com/jaltekruse/incubator-drill/tree/drill-1.5.0-rc3


[jira] [Resolved] (DRILL-4230) NullReferenceException when SELECTing from empty mongo collection

2016-02-09 Thread Jason Altekruse (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Altekruse resolved DRILL-4230.

   Resolution: Fixed
Fix Version/s: 1.5.0

Fixed in ed2f1ca8ed3c0ebac7e33494db6749851fc2c970

This was applied separately to the 1.5 release branch, so the commit there has 
identical content and the same commit message, but will have a different hash.

> NullReferenceException when SELECTing from empty mongo collection
> -
>
> Key: DRILL-4230
> URL: https://issues.apache.org/jira/browse/DRILL-4230
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - MongoDB
>Affects Versions: 1.3.0
>Reporter: Brick Shitting Bird Jr.
>Assignee: Jason Altekruse
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request: DRILL-4184: support variable length decimal fi...

2016-02-09 Thread daveoshinsky
GitHub user daveoshinsky opened a pull request:

https://github.com/apache/drill/pull/372

DRILL-4184: support variable length decimal fields in parquet

Support decimal fields in parquet that are stored as variable length 
BINARY.  Parquet files that store decimal values this way are often 
significantly smaller than ones storing decimal values as 
FIXED_LEN_BYTE_ARRAY's (full precision).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/daveoshinsky/drill master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/372.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #372


commit 9a47ca52125139d88adf39b5d894a02f870f37d9
Author: U-COMMVAULT-NJ\doshinsky 
Date:   2016-02-09T22:37:47Z

DRILL-4184: support variable length decimal fields in parquet

commit dec00a808c99554f008e23fd21b944b858aa9ae0
Author: daveoshinsky 
Date:   2016-02-09T22:56:28Z

DRILL-4184: changes to support variable length decimal fields in parquet




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread hnfgns
Github user hnfgns commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52383230
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem 
fs, FileSelection selectio
 // /a/b/c.parquet and the format of the selection root must match 
that of the file names
 // otherwise downstream operations such as partition pruning can 
break.
 final Path metaRootPath = 
Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath());
-final FileSelection newSelection = FileSelection.create(null, 
fileNames, metaRootPath.toString());
+final FileSelection newSelection = new 
FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString());
--- End diff --

Filed DRILL-4381. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.

2016-02-09 Thread Parth Chandra (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Chandra resolved DRILL-4380.
--
Resolution: Fixed

Fixed in 7bfcb40a0ffa49a1ed27e1ff1f57378aa1136bbd. Also see DRILL-4381

> Fix performance regression: in creation of FileSelection in 
> ParquetFormatPlugin to not set files if metadata cache is available.
> 
>
> Key: DRILL-4380
> URL: https://issues.apache.org/jira/browse/DRILL-4380
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Parth Chandra
>
> The regression has been caused by the changes in 
> 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over 
> empty folders consistently so that they report table not found rather than 
> failing.)
> In ParquetFormatPlugin, the original code created a FileSelection object in 
> the following code:
> {code}
> return new FileSelection(fileNames, metaRootPath.toString(), metadata, 
> selection.getFileStatusList(fs));
> {code}
> The selection.getFileStatusList call made an inexpensive call to 
> FileSelection.init(). The call was inexpensive because the 
> FileSelection.files member was not set and the code does not need to make an 
> expensive call to get the file statuses corresponding to the files in the 
> FileSelection.files member.
> In the new code, this is replaced by 
> {code}
>   final FileSelection newSelection = FileSelection.create(null, fileNames, 
> metaRootPath.toString());
> return ParquetFileSelection.create(newSelection, metadata);
> {code}
> This sets the FileSelection.files member but not the FileSelection.statuses 
> member. A subsequent call to FileSelection.getStatuses ( in 
> ParquetGroupScan() ) now makes an expensive call to get all the statuses.
> It appears that there was an implicit assumption that the 
> FileSelection.statuses member should be set before the FileSelection.files 
> member is set. This assumption is no longer true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/369


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-4381) Replace direct uses of FileSelection c'tor with create()

2016-02-09 Thread Hanifi Gunes (JIRA)
Hanifi Gunes created DRILL-4381:
---

 Summary: Replace direct uses of FileSelection c'tor with create()
 Key: DRILL-4381
 URL: https://issues.apache.org/jira/browse/DRILL-4381
 Project: Apache Drill
  Issue Type: Bug
Reporter: Hanifi Gunes
Assignee: Hanifi Gunes


We should avoid direct creation of FileSelection. This patch proposes either a 
re-design or removing instances where FileSelection c'tor is used directly. We 
also need more documentation around FileSelection abstraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request: DRILL-4363: Row count based pruning for parque...

2016-02-09 Thread jinfengni
GitHub user jinfengni opened a pull request:

https://github.com/apache/drill/pull/371

DRILL-4363: Row count based pruning for parquet table used in Limit n…

… query.

Modify two existint unit testcase:
1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning 
applied after false condition is transformed into LIMIT 0
2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the 
testcase to use Json source, so that it does not mix with PushLimitIntoScanRule.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jinfengni/incubator-drill DRILL-4363

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/371.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #371


commit a84d61fe2b820fe8395e73347dfb0e2986ed9dd0
Author: Jinfeng Ni 
Date:   2016-02-02T23:31:47Z

DRILL-4363: Row count based pruning for parquet table used in Limit n query.

Modify two existint unit testcase:
1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning 
applied after false condition is transformed into LIMIT 0
2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the 
testcase to use Json source, so that it does not mix with PushLimitIntoScanRule.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52379092
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem 
fs, FileSelection selectio
 // /a/b/c.parquet and the format of the selection root must match 
that of the file names
 // otherwise downstream operations such as partition pruning can 
break.
 final Path metaRootPath = 
Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath());
-final FileSelection newSelection = FileSelection.create(null, 
fileNames, metaRootPath.toString());
+final FileSelection newSelection = new 
FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString());
--- End diff --

Agreed. Hanifi made the change to the api initially with exactly that in 
mind. This patch reverts that partially. He's logging a new Jira to fix the api 
and document usage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread jacques-n
Github user jacques-n commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52376249
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -73,13 +75,18 @@ public String getSelectionRoot() {
   }
 
   public List getStatuses(final DrillFileSystem fs) throws 
IOException {
-if (statuses == null) {
+Stopwatch timer = Stopwatch.createStarted();
+
+if (statuses == null)  {
   final List newStatuses = Lists.newArrayList();
   for (final String pathStr:files) {
 newStatuses.add(fs.getFileStatus(new Path(pathStr)));
   }
   statuses = newStatuses;
 }
+logger.info("FileSelection.getStatuses() took {} ms, numFiles: {}",
--- End diff --

DEBUG?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-3623: Optimize limit 0 queries

2016-02-09 Thread sudheeshkatkam
Github user sudheeshkatkam commented on the pull request:

https://github.com/apache/drill/pull/364#issuecomment-182089660
  
@StevenMPhillips This is still WIP, right?

@hsuanyi and I plan to post an update soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread jacques-n
Github user jacques-n commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52376432
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem 
fs, FileSelection selectio
 // /a/b/c.parquet and the format of the selection root must match 
that of the file names
 // otherwise downstream operations such as partition pruning can 
break.
 final Path metaRootPath = 
Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath());
-final FileSelection newSelection = FileSelection.create(null, 
fileNames, metaRootPath.toString());
+final FileSelection newSelection = new 
FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString());
--- End diff --

It seems like we keep having issues with misuse of this interface which 
causes planning regressions. Do you think it makes sense to either change the 
api or add additional comments to make sure people aren't doing the wrong thing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread hnfgns
Github user hnfgns commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52378537
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java
 ---
@@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem 
fs, FileSelection selectio
 // /a/b/c.parquet and the format of the selection root must match 
that of the file names
 // otherwise downstream operations such as partition pruning can 
break.
 final Path metaRootPath = 
Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath());
-final FileSelection newSelection = FileSelection.create(null, 
fileNames, metaRootPath.toString());
+final FileSelection newSelection = new 
FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString());
--- End diff --

Whole point of making this c'tor non public was to centralize creation via 
FileSelection.create(...). Looks like we need more explicit comments over here. 
For this patch, a public c'tor seems not required as well. 
FileSelection.create(selections, null, root) should do the trick.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread jacques-n
Github user jacques-n commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52376311
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -118,7 +126,10 @@ public boolean apply(@Nullable FileStatus status) {
   }
 }));
 
-return create(nonDirectories, null, selectionRoot);
+final FileSelection fileSel = create(nonDirectories, null, 
selectionRoot);
+logger.info("FileSelection.minusDirectories() took {} ms, numFiles: 
{}",
--- End diff --

same, DEBUG seems more appropriate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: Drill 4372

2016-02-09 Thread hsuanyi
GitHub user hsuanyi opened a pull request:

https://github.com/apache/drill/pull/370

Drill 4372

It's the prerequisite type exposure functionality, which can help speed up 
Limit-0 Queries. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hsuanyi/incubator-drill DRILL-4372

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/370.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #370


commit 734ebfae0180dfb3119b9d136b23d46e5f1c24b4
Author: Sudheesh Katkam 
Date:   2015-12-22T04:38:59Z

Validate Drill functions (argument and return types). WIP.

commit 22190c9451032680c24056f94b6c7dfcd7ae788d
Author: Hsuan-Yi Chu 
Date:   2015-12-30T22:21:10Z

DRILL-4372: Expose the functions return type to Drill




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread adeneche
Github user adeneche commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52377202
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -118,7 +126,10 @@ public boolean apply(@Nullable FileStatus status) {
   }
 }));
 
-return create(nonDirectories, null, selectionRoot);
+final FileSelection fileSel = create(nonDirectories, null, 
selectionRoot);
+logger.info("FileSelection.minusDirectories() took {} ms, numFiles: 
{}",
--- End diff --

minusDirectories() ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread jacques-n
Github user jacques-n commented on a diff in the pull request:

https://github.com/apache/drill/pull/369#discussion_r52376878
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
@@ -183,12 +194,16 @@ private static String buildPath(final String[] path, 
final int folderIndex) {
   }
 
   public static FileSelection create(final DrillFileSystem fs, final 
String parent, final String path) throws IOException {
+Stopwatch timer = Stopwatch.createStarted();
 final Path combined = new Path(parent, removeLeadingSlash(path));
 final FileStatus[] statuses = fs.globStatus(combined);
 if (statuses == null) {
   return null;
 }
-return create(Lists.newArrayList(statuses), null, 
combined.toUri().toString());
+final FileSelection fileSel = create(Lists.newArrayList(statuses), 
null, combined.toUri().toString());
+logger.info("FileSelection.create() took {} ms ", 
timer.elapsed(TimeUnit.MILLISECONDS));
--- End diff --

INFO => DEBUG


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread jacques-n
Github user jacques-n commented on the pull request:

https://github.com/apache/drill/pull/369#issuecomment-182081851
  
Other than INFO => DEBUG, +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request: DRILL-4380: Fix performance regression: in cre...

2016-02-09 Thread parthchandra
GitHub user parthchandra opened a pull request:

https://github.com/apache/drill/pull/369

DRILL-4380: Fix performance regression: in creation of FileSelection …

…in ParquetFormatPlugin to not set files if metadata cache is available.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/parthchandra/incubator-drill DRILL-4380

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #369


commit be374c12992ef581a285b0a260bb9ad037d6df92
Author: Parth Chandra 
Date:   2015-12-18T00:30:42Z

DRILL-4380: Fix performance regression: in creation of FileSelection in 
ParquetFormatPlugin to not set files if metadata cache is available.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.

2016-02-09 Thread Parth Chandra (JIRA)
Parth Chandra created DRILL-4380:


 Summary: Fix performance regression: in creation of FileSelection 
in ParquetFormatPlugin to not set files if metadata cache is available.
 Key: DRILL-4380
 URL: https://issues.apache.org/jira/browse/DRILL-4380
 Project: Apache Drill
  Issue Type: Bug
Reporter: Parth Chandra



The regression has been caused by the changes in 
367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over empty 
folders consistently so that they report table not found rather than failing.)

In ParquetFormatPlugin, the original code created a FileSelection object in the 
following code:
{code}
return new FileSelection(fileNames, metaRootPath.toString(), metadata, 
selection.getFileStatusList(fs));
{code}
The selection.getFileStatusList call made an inexpensive call to 
FileSelection.init(). The call was inexpensive because the FileSelection.files 
member was not set and the code does not need to make an expensive call to get 
the file statuses corresponding to the files in the FileSelection.files member.
In the new code, this is replaced by 
{code}
  final FileSelection newSelection = FileSelection.create(null, fileNames, 
metaRootPath.toString());
return ParquetFileSelection.create(newSelection, metadata);
{code}
This sets the FileSelection.files member but not the FileSelection.statuses 
member. A subsequent call to FileSelection.getStatuses ( in ParquetGroupScan() 
) now makes an expensive call to get all the statuses.

It appears that there was an implicit assumption that the 
FileSelection.statuses member should be set before the FileSelection.files 
member is set. This assumption is no longer true.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Hangout starting now!

2016-02-09 Thread Jason Altekruse
Join us to hear the latest Drill news or to bring up any concerns you would
like to see addressed.

https://plus.google.com/hangouts/_/dremio.com/drillhangout?authuser=1


Re: [VOTE] Release Apache Drill 1.5.0 RC2

2016-02-09 Thread Jason Altekruse
Alright, that sinks the release, I'll have a new candidate up shortly.

On Tue, Feb 9, 2016 at 5:50 AM, Jacques Nadeau  wrote:

> It sounds like a blocker to me.
>
> I'm switching to -1
> On Feb 8, 2016 9:17 PM, "Sudheesh Katkam"  wrote:
>
> > I agree that there should be tests with queuing enabled (at least sanity
> > tests). I did not mean to delay the release, but this regression causes
> all
> > queries to fail with an illegal state transition exception (when queueing
> > is enabled).
> >
> > Thank you,
> > Sudheesh
> >
> > > On Feb 8, 2016, at 6:22 PM, Jason Altekruse 
> > wrote:
> > >
> > > The case that was reported in the JIRA was a failure on a very simple
> > > query:  select * from sys.options;
> > >
> > > I assume this means that any query will fail when queuing is enabled.
> > That
> > > would make a strong case for inclusion in the release, I didn't look
> > > closely at the JIRA before. Hakim, you reviewed the patch, but it
> doesn't
> > > include any new tests. Did Hanifi mention if the change made there was
> > > necessary to pretty much fix any query when queuing was enabled?
> > >
> > > - Jason
> > >
> > > On Mon, Feb 8, 2016 at 4:57 PM, Abdel Hakim Deneche <
> > adene...@maprtech.com>
> > > wrote:
> > >
> > >> Does it mean that any user who's been using queuing won't be able to
> use
> > >> 1.5.0 ?
> > >>
> > >> On Mon, Feb 8, 2016 at 4:40 PM, Jason Altekruse <
> > altekruseja...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hey Sudheesh,
> > >>>
> > >>> I just pushed Venki's fix for the Web UI issue to the master branch.
> > >>>
> > >>> My fix for the build issue I ran into when trying to prepare the
> > release
> > >> is
> > >>> a fair point. The change only has a very limited impact on the build,
> > and
> > >>> only changes the result when running a release itself. I should have
> > been
> > >>> better communicating the change that was made, I have the posted an
> > >> update
> > >>> on the JIRA I filed to do a follow-up investigation of the problem
> > [1]. I
> > >>> didn't include it on m merge branch with Venki's change, but I will
> > post
> > >> it
> > >>> shortly associated with this new JIRA [2] for review and kick off the
> > >> tests
> > >>> with the change rebased.
> > >>>
> > >>> As far as 4235 is concerned, I would like the release to be as stable
> > as
> > >>> possible, but the release has taken quite a long time to get to vote.
> > >> This
> > >>> issue was filed at the end of December, and was fixed just 4 days
> ago,
> > >> with
> > >>> no comment on the previous release thread about including the fix in
> > the
> > >>> release. I fully support making queuing a first-class feature of
> Drill,
> > >> but
> > >>> we need to add automated tests for it if we want it to stay stable.
> > >>>
> > >>> I'm open to discussion on the topic, but I'm not sure we should delay
> > the
> > >>> release further for it.
> > >>>
> > >>> - Jason
> > >>>
> > >>> [1] - https://issues.apache.org/jira/browse/DRILL-4336
> > >>> [2] - https://issues.apache.org/jira/browse/DRILL-4375
> > >>>
> > >>> On Mon, Feb 8, 2016 at 2:59 PM, Sudheesh Katkam <
> skat...@maprtech.com>
> > >>> wrote:
> > >>>
> >  Although my vote is non-binding <
> >  http://drill.apache.org/docs/project-bylaws/#actions>, I have two
> >  concerns:
> > 
> >  * DRILL-4187 
> > >> caused a
> >  critical regression noted in DRILL-4235 <
> >  https://issues.apache.org/jira/browse/DRILL-4235>. There is a patch
> > >> for
> >  DRILL-4235, which is not part of the release candidate. This can
> cause
> >  failures for users that are using the queuing feature.
> > 
> >  * There are commits made to the release branch <
> > 
> https://github.com/jaltekruse/incubator-drill/commits/1.5-release-rc2
> > >
> > >>> in
> >  Jason's repo that are not checked in to master.
> > 
> >  Thanks,
> >  Sudheesh
> > 
> > > On Feb 8, 2016, at 2:30 PM, Jason Altekruse <
> > >> altekruseja...@gmail.com>
> >  wrote:
> > >
> > > Thanks everyone who has voted so far. The vote closes tomorrow
> > >> morning
> >  and
> > > right now we're only at the minimum number of binding votes for it
> to
> >  pass.
> > > Anyone who has some time available, please try out the release and
> > >>> cast a
> > > vote.
> > >
> > > On Mon, Feb 8, 2016 at 2:02 PM, Jacques Nadeau  >
> >  wrote:
> > >
> > >> Downloaded, built and ran unit tests.
> > >> Manually tried a few queries.
> > >>
> > >> Looks good
> > >>
> > >> +1 (binding)
> > >>
> > >>
> > >> --
> > >> Jacques Nadeau
> > >> CTO and Co-Founder, Dremio
> > >>
> > >> On Sun, Feb 7, 2016 at 10:03 AM, Aman Sinha  >
> >  wrote:
> > >>
> > >>> +1
> > >>> - Downloaded src and built, ran unit tests on my Mac
> > >>> - Manually ran a few queries against TPC-DS
> > >>> - Verified partition pruni

[jira] [Created] (DRILL-4379) Unexpected Table Behavior with only one subdirectory vs. Many

2016-02-09 Thread John Omernik (JIRA)
John Omernik created DRILL-4379:
---

 Summary: Unexpected Table Behavior with only one subdirectory vs. 
Many 
 Key: DRILL-4379
 URL: https://issues.apache.org/jira/browse/DRILL-4379
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Affects Versions: 1.4.0
Reporter: John Omernik


A common practice is to use directories below a main directory as a 
partitioning device.  Say you have a table named "myawesomedata" and you get 
data into that table every day, it would be valuable to create the main 
directory, then subdirectories per day to help optimize queries running against 
only certain days of data.

/myawesomedata/
/myawesomedata/2016-02-01
/myawesomedata/2016-02-02
/myawesomedata/2016-02-03
/myawesomedata/2016-02-04

I have identified a condition that if there is ONLY one subdirectory, queries 
do not return results as expected by a user. 

Example:

In the above, if I run a query of 

select count(1) from `myawesomedata`;

I get accurate results of the count in all subdirectories

If I run:

select count(1) from `myawesomedata` where dir0 = '2016-02-01';

I get accurate results of the count of only the subdirectory 2016-02-01

However, if I delete subdirectories 2016-02-02, 2016-02-03, and 2016-02-04 and 
am left with:

/myawesomedata/
/myawesomedata/2016-02-01

Then if I run 

select count(1) from `myawesomedata`;

It returns the accurate count (which is just that of the 2016-02-01 directory). 

However, if I run

select count(1) from `myawesomedata` where dir0 = '2016-02-01';

It takes much longer (15 seconds vs instant on the other queries) and returns 
no results.  Even though this is  the same query as above that worked with 2 or 
more subdirectories.  Basically, when there is only one subdirectory, a query 
asking for only that directory does not work in the same way as when there are 
more subdirectories.  This is an unexpected user experience and something I 
believe could cause user frustration and unexpected results from Drill usage on 
data. 

 






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4378) CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE

2016-02-09 Thread John Omernik (JIRA)
John Omernik created DRILL-4378:
---

 Summary: CONVERT_FROM in View results in table scan of MapR-DB and 
perhaps HBASE
 Key: DRILL-4378
 URL: https://issues.apache.org/jira/browse/DRILL-4378
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization, Storage - HBase
Affects Versions: 1.4.0
Reporter: John Omernik


 I created a view to avoid forcing users to write queries that always included 
the CONVERT_FROM statements. (I am a huge advocate of making things easy for 
the the users and writing queries with CONVERT_FROM statements isn't easy). 

I ran a query the other day on one of these views and noticed that a query that 
took 30 seconds really shouldn't take 30 seconds.  What do I mean? well I 
wanted to get part of a record by looking up the MapR-DB Row key (equiv. to 
HBASE row key)  That should be an instant lookup.  Sure enough, when I tried it 
in the hbase shell that returns instantly.  So why did Drill take 30 seconds?  
I shot an email to Ted and Jim at MapR to ask this very question. Ted suggested 
that I try the query without a view.  Sure enough, If I use the convert_from in 
a direct query, it's an instant (sub second) return.  Thus it appears something 
in the view is not allowing the query to short circuit the read.  

Ted suggests I post here  (I am curious if anyone who has HBASE setup is seeing 
this same issue with views) but also include the EXPLAIN plan.  Basically, 
using my very limited ability to read EXPLAIN plans (If someone has a pointer 
to a blog post or docs on how to read EXPLAIN I would love that!) it looks like 
in the view the startRow and stopRow in the hbaseScanSpec are not set, seeming 
to cause a scan.  Is there any away to assist the planner when running this 
through a view so that we can get the performance of the query without the view 
but with the easy of use/readability of using the view?

Thanks!!!

John

View Creation

CREATE VIEW view_testpaste as 
SELECT 
CONVERT_FROM(row_key, 'UTF8') AS pasteid,
CONVERT_FROM(pastes.pdata.lang, 'UTF8') AS lang,
CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste
FROM dfs.`pastes`.`/pastes` pastes;


Select from view takes 32 seconds (seems to be a scan)

> select paste from view_testpaste where pasteid = 'djHEHcPM'

1 row selected (32.302 seconds)


Just a direct select returns very fast (0.486 seconds)

> select CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste
FROM dfs.`pastes`.`/pastes` pastes where 
CONVERT_FROM(row_key, 'UTF8') = 'djHEHcPM';

1 row selected (0.486 seconds)




EXPLAIN PLAN FOR select paste from view_testpaste where pasteid = 'djHEHcPM'

+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  UnionExchange
01-01Project(paste=[CONVERT_FROMUTF8($1)])
01-02  SelectionVectorRemover
01-03Filter(condition=[=(CONVERT_FROMUTF8($0), 'djHEHcPM')])
01-04  Project(row_key=[$1], ITEM=[ITEM($0, 'paste')])
01-05Scan(groupscan=[MapRDBGroupScan 
[HBaseScanSpec=HBaseScanSpec [tableName=maprfs:///data/pastebiner/pastes, 
startRow=null, stopRow=null, filter=null], columns=[`row_key`, `raw`.`paste`]]])
 | {
  "head" : {
"version" : 1,
"generator" : {
  "type" : "ExplainHandler",
  "info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ ],
"queue" : 0,
"resultMode" : "EXEC"
  },
  "graph" : [ {
"pop" : "maprdb-scan",
"@id" : 65541,
"userName" : "darkness",
"hbaseScanSpec" : {
  "tableName" : "maprfs:///data/pastebiner/pastes",
  "startRow" : "",
  "stopRow" : "",
  "serializedFilter" : null
},
"storage" : {
  "type" : "file",
  "enabled" : true,
  "connection" : "maprfs:///",
  "workspaces" : {
"root" : {
  "location" : "/",
  "writable" : false,
  "defaultInputFormat" : null
},
 "pastes" : {
  "location" : "/data/pastebiner",
  "writable" : true,
  "defaultInputFormat" : null
},
"dev" : {
  "location" : "/data/dev",
  "writable" : true,
  "defaultInputFormat" : null
},
"hive" : {
  "location" : "/user/hive",
  "writable" : true,
  "defaultInputFormat" : null
},
"tmp" : {
  "location" : "/tmp",
  "writable" : true,
  "defaultInputFormat" : null
}
  },
  "formats" : {
"psv" : {
  "type" : "text",
  "extensions" : [ "tbl" ],
  "delimiter" : "|"
},
"csv" : {
  "type" : "text",
  "extensions" : [ "csv" ],
  "escape" : "`",
  "delimiter" : ","
},
"tsv" : {
  "type" : "text",
  "extensions" : [ "tsv" ],
  "delimiter" : "\t"
},
"parquet" : {
  "type" : "parquet"
},

Re: [VOTE] Release Apache Drill 1.5.0 RC2

2016-02-09 Thread Jacques Nadeau
It sounds like a blocker to me.

I'm switching to -1
On Feb 8, 2016 9:17 PM, "Sudheesh Katkam"  wrote:

> I agree that there should be tests with queuing enabled (at least sanity
> tests). I did not mean to delay the release, but this regression causes all
> queries to fail with an illegal state transition exception (when queueing
> is enabled).
>
> Thank you,
> Sudheesh
>
> > On Feb 8, 2016, at 6:22 PM, Jason Altekruse 
> wrote:
> >
> > The case that was reported in the JIRA was a failure on a very simple
> > query:  select * from sys.options;
> >
> > I assume this means that any query will fail when queuing is enabled.
> That
> > would make a strong case for inclusion in the release, I didn't look
> > closely at the JIRA before. Hakim, you reviewed the patch, but it doesn't
> > include any new tests. Did Hanifi mention if the change made there was
> > necessary to pretty much fix any query when queuing was enabled?
> >
> > - Jason
> >
> > On Mon, Feb 8, 2016 at 4:57 PM, Abdel Hakim Deneche <
> adene...@maprtech.com>
> > wrote:
> >
> >> Does it mean that any user who's been using queuing won't be able to use
> >> 1.5.0 ?
> >>
> >> On Mon, Feb 8, 2016 at 4:40 PM, Jason Altekruse <
> altekruseja...@gmail.com>
> >> wrote:
> >>
> >>> Hey Sudheesh,
> >>>
> >>> I just pushed Venki's fix for the Web UI issue to the master branch.
> >>>
> >>> My fix for the build issue I ran into when trying to prepare the
> release
> >> is
> >>> a fair point. The change only has a very limited impact on the build,
> and
> >>> only changes the result when running a release itself. I should have
> been
> >>> better communicating the change that was made, I have the posted an
> >> update
> >>> on the JIRA I filed to do a follow-up investigation of the problem
> [1]. I
> >>> didn't include it on m merge branch with Venki's change, but I will
> post
> >> it
> >>> shortly associated with this new JIRA [2] for review and kick off the
> >> tests
> >>> with the change rebased.
> >>>
> >>> As far as 4235 is concerned, I would like the release to be as stable
> as
> >>> possible, but the release has taken quite a long time to get to vote.
> >> This
> >>> issue was filed at the end of December, and was fixed just 4 days ago,
> >> with
> >>> no comment on the previous release thread about including the fix in
> the
> >>> release. I fully support making queuing a first-class feature of Drill,
> >> but
> >>> we need to add automated tests for it if we want it to stay stable.
> >>>
> >>> I'm open to discussion on the topic, but I'm not sure we should delay
> the
> >>> release further for it.
> >>>
> >>> - Jason
> >>>
> >>> [1] - https://issues.apache.org/jira/browse/DRILL-4336
> >>> [2] - https://issues.apache.org/jira/browse/DRILL-4375
> >>>
> >>> On Mon, Feb 8, 2016 at 2:59 PM, Sudheesh Katkam 
> >>> wrote:
> >>>
>  Although my vote is non-binding <
>  http://drill.apache.org/docs/project-bylaws/#actions>, I have two
>  concerns:
> 
>  * DRILL-4187 
> >> caused a
>  critical regression noted in DRILL-4235 <
>  https://issues.apache.org/jira/browse/DRILL-4235>. There is a patch
> >> for
>  DRILL-4235, which is not part of the release candidate. This can cause
>  failures for users that are using the queuing feature.
> 
>  * There are commits made to the release branch <
>  https://github.com/jaltekruse/incubator-drill/commits/1.5-release-rc2
> >
> >>> in
>  Jason's repo that are not checked in to master.
> 
>  Thanks,
>  Sudheesh
> 
> > On Feb 8, 2016, at 2:30 PM, Jason Altekruse <
> >> altekruseja...@gmail.com>
>  wrote:
> >
> > Thanks everyone who has voted so far. The vote closes tomorrow
> >> morning
>  and
> > right now we're only at the minimum number of binding votes for it to
>  pass.
> > Anyone who has some time available, please try out the release and
> >>> cast a
> > vote.
> >
> > On Mon, Feb 8, 2016 at 2:02 PM, Jacques Nadeau 
>  wrote:
> >
> >> Downloaded, built and ran unit tests.
> >> Manually tried a few queries.
> >>
> >> Looks good
> >>
> >> +1 (binding)
> >>
> >>
> >> --
> >> Jacques Nadeau
> >> CTO and Co-Founder, Dremio
> >>
> >> On Sun, Feb 7, 2016 at 10:03 AM, Aman Sinha 
>  wrote:
> >>
> >>> +1
> >>> - Downloaded src and built, ran unit tests on my Mac
> >>> - Manually ran a few queries against TPC-DS
> >>> - Verified partition pruning, metadata caching was working as
> >>> expected
> >> for
> >>> these test queries
> >>> - Checked query profile in Web UI, checked query cancellation
> >>> - Found 1 performance issue with lots of small parquet files
> >> ...filed
> >>> DRILL-4365 but need confirmation whether it is reproducible for
> >> other
> >>> folks.  At this point, I am not considering it a blocker due to the
>  fact
> >> I
> >>> could n

Re: project build fails -> drill-jdbc-all-1.5.0-SNAPSHOT.jar is outside the expected size range

2016-02-09 Thread Jacques Nadeau
By the way, I believe you can skip the enforcer plugin execution with
 -Denforcer.skip=true
On Feb 8, 2016 8:27 PM, "Jacques Nadeau"  wrote:

> I'm against removing the check as it is actually building a functionally
> invalid jdbc jar file.  Any tests using that jdbc jar are invalid since
> they include different code than the release jdbc driver. This caused major
> regressions in the 1.4 jdbc driver.
> On Feb 8, 2016 4:48 PM, "Jason Altekruse" 
> wrote:
>
>> Hey Sudheesh,
>>
>> Unfortunately it will not fix this issue, it is related specifically to
>> how
>> the addition of the enforcer (for currently unknown reasons) caused the
>> release profile to fail in a new way. I hadn't run into issues with
>> enforcer itself actually failing with my version of Maven.
>>
>> I would be in favor of a flag to make it easier to disable this check, we
>> can even change the message to tell people about the flag (it could be
>> updated now to suggest upgrading maven), but I do think we should keep
>> this
>> enforcer rule on by default as the 1.4 release had a pretty bloated JAR
>> because this wasn't being checked.
>>
>> - Jason
>>
>> On Mon, Feb 8, 2016 at 4:23 PM, Sudheesh Katkam 
>> wrote:
>>
>> > @Jason, does DRILL-4375 <
>> https://issues.apache.org/jira/browse/DRILL-4375>
>> > address this issue as well?
>> >
>> > > On Feb 8, 2016, at 4:19 PM, Sudheesh Katkam 
>> > wrote:
>> > >
>> > > On one of the Linux VMs, when I run mvn clean install -DskipTests
>> > -Pmapr, I get this error with 3.3.x (but not with 3.2.x). Weird.
>> > >
>> > > Should we disable the rule until we figure out the cause?
>> > >
>> > > - Sudheesh
>> > >
>> > >> On Feb 2, 2016, at 6:11 AM, Jacques Nadeau > > > jacq...@dremio.com>> wrote:
>> > >>
>> > >> This is a bug in maven we  haven't figured out yet how we're causing.
>> > >> Upgrading to Maven 3.3.x fixes it.
>> > >> On Feb 2, 2016 2:18 AM, "Arina Yelchiyeva" <
>> arina.yelchiy...@gmail.com
>> > >
>> > >> wrote:
>> > >>
>> > >>> Hi all!
>> > >>>
>> > >>> Just pulled recent changes from master (revision number
>> > >>> 1b96174b1e5bafb13a873dd79f03467802d7c929) and mvn clean install
>> > -DskipTests
>> > >>> failed with the following error:
>> > >>>
>> > >>> *[ERROR] Failed to execute goal
>> > >>> org.apache.maven.plugins:maven-enforcer-plugin:1.3.1:enforce
>> > >>> (enforce-jdbc-jar-compactness) on project drill-jdbc-all: Some
>> Enforcer
>> > >>> rules have failed. Look above for specific messages explaining why
>> the
>> > rule
>> > >>> failed.*
>> > >>>
>> > >>> *[WARNING] Rule 0:
>> org.apache.maven.plugins.enforcer.RequireFilesSize
>> > >>> failed with message:*
>> > >>> *The file drill-jdbc-all-1.5.0-SNAPSHOT.jar is outside the expected
>> > size
>> > >>> range. *
>> > >>>
>> > >>> *This is likely due to you adding new dependencies to a java-exec
>> and
>> > not
>> > >>> updating the excludes in this module. This is important as it
>> > minimizes the
>> > >>> size of the dependency of Drill application users.*
>> > >>>
>> >
>> *F:\git_repo\drill\exec\jdbc-all\target\drill-jdbc-all-1.5.0-SNAPSHOT.jar
>> > >>> size (44664290) too large. Max. is 2000
>> > >>>
>> >
>> F:\git_repo\drill\exec\jdbc-all\target\drill-jdbc-all-1.5.0-SNAPSHOT.jar*
>> > >>>
>> > >>> Had to change 2000 ->
>> > 5000 in
>> > >>> jdbc-all pom.xml to build the project.
>> > >>>
>> > >>> Do we need to create jira for this or it's already being fixed?
>> > >>>
>> > >>> Kind regards
>> > >>> Arina
>> > >>>
>> > >
>> >
>> >
>>
>